snipar.scripts.impute module

This script performs imputation of missing parental genotypes from observed genotypes in a family. It can impute missing parents from families with no genotyped parents but at least two genotyped siblings, or one genotyped parent and one or more genotyped offspring. To specify the siblings, one can either provide a pedigree file (–pedigree option) or

the relatedness inference output from KING with the –related –degree 1 options along with age and sex information.

The pedigree file is a plain text file with header and columns: FID (family ID), IID (individual ID), FATHER_ID (ID of father), MOTHER_ID (ID of mother). Note that individuals are assumed to have unique individual IDS (IID). Siblings are identified through individuals that have the same FID and the same FATHER_ID and MOTHER_ID.

Use the –king option to provide the KING relatedness inference output (usually has suffix .kin0) and the –agesex option to provide the age & sex information. The script constructs a pedigree from this information and outputs it in the HDF5 output.

Args:
‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

‘-c’

Duplicates offsprings of families with more than one offspring and both parents and add ‘_’ to the start of their FIDs. These can be used for testing the imputation. The tests.test_imputation.imputation_test uses these.

‘-silent_progress’

Hides the percentage of progress from logging

‘-use_backup’

Whether it should use backup imputation where there is no ibd infomation available

‘–ibd’str

Address of the IBD file without suffix. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).

‘–ibd_is_king’

If not provided the ibd input is assumed to be in snipar. Otherwise its in king format with an allsegs file

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–bim’str

Address of a bim file containing positions of SNPs if the address is different from Bim file of genotypes

‘–fam’str

Address of a fam file containing positions of SNPs if the address is different from fam file of genotypes

‘–out’str, default=parent_imputed

Writes the result of imputation for chromosome i to outprefix{i}

‘–start’int

The script can do the imputation on a slice of each chromosome. This is the start of that slice(it is inclusive)

‘–end’int

The script can do the imputation on a slice of each chromosome. This is the end of that slice(it is inclusive).

‘–pedigree’str

Address of the pedigree file. Pedigree file is a ‘ ‘ seperated csv with columns ‘FID’, ‘IID’, ‘FATHER_ID’, ‘MOTHER_ID’. Default NaN value of Pedigree file is ‘0’. If your NaN value is something else be sure to specify it with –pedigree_nan option.

‘–king’str

Address of a kinship file in KING format. kinship file is a ‘ ‘ seperated csv with columns “FID1”, “ID1”, “FID2”, “ID2, “InfType”.

Each row represents a relationship between two individuals. InfType column states the relationship between two individuals. The only relationships that matter for this script are full sibling and parent-offspring which are shown by ‘FS’ and ‘PO’ respectively. This file is used in creating a pedigree file and can be generated using KING.

‘–agesex’str

Address of the agesex file. This is a ” ” seperated CSV with columns “FID”, “IID”, “FATHER_ID”, “MOTHER_ID”, “sex”, “age”.

Each row contains the age and sex of one individual. Male and Female sex should be represented with ‘M’ and ‘F’. Age column is used for distinguishing between parent and child in a parent-offsring relationship inferred from the kinship file. ID1 is a parent of ID2 if there is a ‘PO’ relationship between them and ‘ID1’ is at least 12 years older than ID2.

‘–pcs’str

Address of the PCs file with header “FID IID PC1 PC2 …”. If provided MAFs will be estimated from PCs

‘–pc_num’int

Number of PCs to consider

‘-find_optimal_pc’

It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.

‘–threads’int, default=1

Number of the threads to be used. This should not exceed number of the available cores. The default number of the threads is one.

‘–processes’int, default=1

Number of processes for imputation chromosomes. Each chromosome is done on one process.

‘–chunks’int, default=1

Number of chunks load data in(each process).

‘–output_compression’str

Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf

‘–output_compression_opts’int

Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information.

‘–pedigree_nan’str, default=0

The value representing NaN in the pedigreee.

Results:
HDF5 files

For each chromosome i, an HDF5 file is created at outprefix{i}. This file contains imputed genotypes, the position of SNPs, columns of resulting bim file, contents of resulting bim file, pedigree table and, family ids of the imputed parents, under the keys ‘imputed_par_gts’, ‘pos’, ‘bim_columns’, ‘bim_values’, ‘pedigree’ and, ‘families’, ‘parental_status’ respectively. There are also other details of the imputation in the resulting file. For more explanation see the documentation of snipar.imputation.impute_from_sibs.impute

snipar.scripts.impute.main(args)[source]

“Calling this function with args is equivalent to running this script from commandline with the same arguments. Args:

args: list

list of all the desired options and arguments. The possible values are all the values you can pass this script from commandline.

snipar.scripts.impute.run_imputation(data)[source]

Runs the imputation and returns the consumed time Args:

datadict

a dictionary with these keys and values: Keys:

pedigree: pd.Dataframe

The standard pedigree table

controlbool

Duplicates offsprings of families with more than one offspring and both parents and add ‘_’ to the start of their FIDs. These can be used for testing the imputation. The tests.test_imputation.imputation_test uses these.

use_backupbool, optional

Whether it should use backup imputation where there is no ibd infomation available.

phased_address: str, optional

Address of the bed file (does not inlude ‘.bgen’). Only one of unphased_address and phased_address is neccessary.

unphased_address: str, optional

Address of the bed file (does not inlude ‘.bed’). Only one of unphased_address and phased_address is neccessary.

ibd_addressstr

address of the ibd file. The king segments file should be accompanied with an allsegs file.

ibd_is_kingboolean

Whether the ibd segments are in king format or snipar format.

pcsnp.array[float], optional

A two-dimensional array containing pc scores for all individuals and SNPs respectively.

pc_idsset, optional

Set of ids of individuals in the pcs.

find_optimal_pcbool, optional

It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.

output_address: str

The address to write the result of imputation on. The default value for output_address is ‘parent_imputed_chr’.

start: int, optional

This function can do the imputation on a slice of each chromosome. If specified, his is the start of that slice(it is inclusive).

end: int, optional

This function can do the imputation on a slice of each chromosome. If specified, his is the end of that slice(it is inclusive).

bim: str, optional

Address of a bim file containing positions of SNPs if the address is different from Bim file of genotypes.

fam: str, optional

Address of a fam file containing positions of SNPs if the address is different from Fam file of genotypes.

threads: int, optional

Number of the threads to be used. This should not exceed number of the available cores. The default number of the threads is one.

chunks: int

Number of chunks load data in(each process).

output_compression: str, optional

Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf. None means no compression.

output_compression_opts: int, optional

Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information. None means no compression setting.

chromosome: str

name of the chromosome

pedigree_nan: str

The value representing NaN in the pedigreee.

silent_progress: bool

Hides the percentage of progress from logging

Returns:
float

time consumed by the imputation.