snipar.scripts.impute module

This script performs imputation of missing parental genotypes from observed genotypes in a family. It can impute missing parents from families with no genotyped parents but at least two genotyped siblings, or one genotyped parent and one or more genotyped offspring. To specify the siblings, one can either provide a pedigree file (–pedigree option) or

the relatedness inference output from KING with the –related –degree 1 options along with age and sex information.

The pedigree file is a plain text file with header and columns: FID (family ID), IID (individual ID), FATHER_ID (ID of father), MOTHER_ID (ID of mother). Note that individuals are assumed to have unique individual IDS (IID). Siblings are identified through individuals that have the same FID and the same FATHER_ID and MOTHER_ID.

Use the –king option to provide the KING relatedness inference output (usually has suffix .kin0) and the –agesex option to provide the age & sex information. The script constructs a pedigree from this information and outputs it in the HDF5 output.

Args:

‘-h’, ‘–help’, default===SUPPRESS==: show this help message and exit
‘-c’: Duplicates offsprings of families with more than one offspring and both parents and add ‘_’ to the start of their FIDs. These can be used for testing the imputation. The tests.test_imputation.imputation_test uses these.
‘-silent_progress’: Hides the percentage of progress from logging
‘-use_backup’: Whether it should use backup imputation where there is no ibd infomation available
‘–ibd’str: Address of the IBD file without suffix. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–ibd_is_king’: If not provided the ibd input is assumed to be in snipar. Otherwise its in king format with an allsegs file
‘–bgen’str: Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–bed’str: Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–chr_range’: number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
‘–bim’str: Address of a bim file containing positions of SNPs if the address is different from Bim file of genotypes
‘–fam’str: Address of a fam file containing positions of SNPs if the address is different from fam file of genotypes
‘–out’str, default=parent_imputed: Writes the result of imputation for chromosome i to outprefix{i}
‘–start’int: The script can do the imputation on a slice of each chromosome. This is the start of that slice(it is inclusive)
‘–end’int: The script can do the imputation on a slice of each chromosome. This is the end of that slice(it is inclusive).
‘–pedigree’str: Address of the pedigree file. Pedigree file is a ‘ ‘ seperated csv with columns ‘FID’, ‘IID’, ‘FATHER_ID’, ‘MOTHER_ID’. Default NaN value of Pedigree file is ‘0’. If your NaN value is something else be sure to specify it with –pedigree_nan option.
‘–king’str: Address of a kinship file in KING format. kinship file is a ‘ ‘ seperated csv with columns “FID1”, “ID1”, “FID2”, “ID2, “InfType”.

Each row represents a relationship between two individuals. InfType column states the relationship between two individuals. The only relationships that matter for this script are full sibling and parent-offspring which are shown by ‘FS’ and ‘PO’ respectively. This file is used in creating a pedigree file and can be generated using KING.
‘–agesex’str: Address of the agesex file. This is a ” ” seperated CSV with columns “FID”, “IID”, “FATHER_ID”, “MOTHER_ID”, “sex”, “age”.

Each row contains the age and sex of one individual. Male and Female sex should be represented with ‘M’ and ‘F’. Age column is used for distinguishing between parent and child in a parent-offsring relationship inferred from the kinship file. ID1 is a parent of ID2 if there is a ‘PO’ relationship between them and ‘ID1’ is at least 12 years older than ID2.
‘–pcs’str: Address of the PCs file with header “FID IID PC1 PC2 …”. If provided MAFs will be estimated from PCs
‘–pc_num’int: Number of PCs to consider
‘-find_optimal_pc’: It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.
‘–threads’int, default=1: Number of the threads to be used. This should not exceed number of the available cores. The default number of the threads is one.
‘–processes’int, default=1: Number of processes for imputation chromosomes. Each chromosome is done on one process.
‘–chunks’int, default=1: Number of chunks load data in(each process).
‘–output_compression’str: Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf
‘–output_compression_opts’int: Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information.
‘–pedigree_nan’str, default=0: The value representing NaN in the pedigreee.

Results:

HDF5 files: For each chromosome i, an HDF5 file is created at outprefix{i}. This file contains imputed genotypes, the position of SNPs, columns of resulting bim file, contents of resulting bim file, pedigree table and, family ids of the imputed parents, under the keys ‘imputed_par_gts’, ‘pos’, ‘bim_columns’, ‘bim_values’, ‘pedigree’ and, ‘families’, ‘parental_status’ respectively. There are also other details of the imputation in the resulting file. For more explanation see the documentation of snipar.imputation.impute_from_sibs.impute

snipar.scripts.impute.main(args)[source]

“Calling this function with args is equivalent to running this script from commandline with the same arguments. Args:

args: list
list of all the desired options and arguments. The possible values are all the values you can pass this script from commandline.

snipar.scripts.impute.run_imputation(data)[source]

Runs the imputation and returns the consumed time Args:

datadict
a dictionary with these keys and values: Keys:

pedigree: pd.Dataframe
The standard pedigree table

controlbool
Duplicates offsprings of families with more than one offspring and both parents and add ‘_’ to the start of their FIDs. These can be used for testing the imputation. The tests.test_imputation.imputation_test uses these.

use_backupbool, optional
Whether it should use backup imputation where there is no ibd infomation available.

phased_address: str, optional
Address of the bed file (does not inlude ‘.bgen’). Only one of unphased_address and phased_address is neccessary.

unphased_address: str, optional
Address of the bed file (does not inlude ‘.bed’). Only one of unphased_address and phased_address is neccessary.

ibd_addressstr
address of the ibd file. The king segments file should be accompanied with an allsegs file.

ibd_is_kingboolean
Whether the ibd segments are in king format or snipar format.

pcsnp.array[float], optional
A two-dimensional array containing pc scores for all individuals and SNPs respectively.

pc_idsset, optional
Set of ids of individuals in the pcs.

find_optimal_pcbool, optional
It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.

output_address: str
The address to write the result of imputation on. The default value for output_address is ‘parent_imputed_chr’.

start: int, optional
This function can do the imputation on a slice of each chromosome. If specified, his is the start of that slice(it is inclusive).

end: int, optional
This function can do the imputation on a slice of each chromosome. If specified, his is the end of that slice(it is inclusive).

bim: str, optional
Address of a bim file containing positions of SNPs if the address is different from Bim file of genotypes.

fam: str, optional
Address of a fam file containing positions of SNPs if the address is different from Fam file of genotypes.

threads: int, optional
Number of the threads to be used. This should not exceed number of the available cores. The default number of the threads is one.

chunks: int
Number of chunks load data in(each process).

output_compression: str, optional
Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf. None means no compression.

output_compression_opts: int, optional
Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information. None means no compression setting.

chromosome: str
name of the chromosome

pedigree_nan: str
The value representing NaN in the pedigreee.

silent_progress: bool
Hides the percentage of progress from logging

Returns:

float: time consumed by the imputation.