snipar.imputation.impute_from_sibs module

Contains functions in cython for doing the parent sum imputation from offsprings and parents(if they are observed).

Functions

get_probability_of_both_parents_conditioned_on_offsprings get_probability_of_one_parent_conditioned_on_offsprings_and_parent get_IBD get_hap_index is_possible_child dict_to_cmap impute_snp_from_offsprings impute_snp_from_parent_offsprings get_IBD_type impute

snipar.imputation.impute_from_sibs.impute()

Does the parent sum imputation for families in sibships and all the SNPs in unphased_gts and returns the results.

Inputs and outputs of this function are ascii bytes instead of strings. It writes result of the imputation to the output_address.

Args:
sibshipspandas.Dataframe

A pandas DataFrame with columns [‘FID’, ‘FATHER_ID’, ‘MOTHER_ID’, ‘IID’] where IID columns is a list of the IIDs of individuals in that family. It only contains families with more than one child. The parental sum is computed for all these families.

iid_to_bed_indexstr->int

A dictionary mapping IIDs of people to their location in the bed file.

phased_gtsnumpy.array[signed char]

Numpy array containing the phased genotype data. Axes are individulas and SNPS respectively. It’s elements should be 0 or 1 except NaN values which should be equal to nan_integer specified in the config.

unphased_gtsnumpy.array[signed char]

Numpy array containing the unphased genotype data from a bed file. Axes are individulas, SNPS and haplotype number respectively. It’s elements should be 0 or 1 except NaN values which should be equal to nan_integer specified in the config.

ibdpandas.Dataframe

A pandas DataFrame with columns “ID1”, “ID2”, ‘segment’. The segments column is a list of IBD segments between ID1 and ID2. Each segment consists of a start, an end, and an IBD status. The segment list is flattened meaning it’s like [start0, end0, ibd_status0, start1, end1, ibd_status1, …]

posnumpy.array

A numpy array with the position of each SNP in the order of appearance in phased and unphased gts.

hdf5_output_dictdict
Other key values to be added to the HDF5 output. Usually contains:

‘bim_columns’ : Columns of the resulting bim file ‘bim_values’ : Contents of the resulting bim file ‘pedigree’ : pedigree table Its columns are has_father, has_mother, single_parent respectively. ‘non_duplicates’ : Indexes of the unique snps. Imputation is restricted to them. ‘standard_f’ : Whether the allele frequencies are just population average instead of MAFs estimated using PCs ‘MAF_*’ : info about the MAF estimator if MAF estimator is used.

chromosome: str

Name of the chromosome(s) that’s going to be imputed. Only used for logging purposes.

freqs: list[float]

A two-dimensional array containing estimated fs for all individuals and SNPs respectively.

output_addressstr, optional

If presented, the results would be written to this address in HDF5 format. Aside from all the key, value pairs inside hdf5_output_dict, the following are also written to the file.

‘imputed_par_gts’ : imputed parental genotypes. It’s the imputed missing parent if only one parent is missing and the imputed average of the both parents if both are missing. ‘pos’ : the position of SNPs(in the order of appearance in genotypes) ‘families’ : family ids of the imputed parents(in the order of appearance in genotypes) ‘parental_status’ : a numpy array where each row shows the family status of the family of the corresponding row in families. Columns are has_father, has_mother and, single_parent. ‘sib_ratio_backup’ : An array with the size of number of snps. Show the ratio of backup imputation among offspring imputations in each snp. ‘parent_ratio_backup’ : An array with the size of number of snps. Show the ratio of backup imputation among parent-offspring imputations in each snp. ‘mendelian_error_ratio’ : Ratio of mendelian errors among parent-offspring pairs for each snp ‘estimated_genotyping_error’ : estimated for each snp using mendelian_error_ratio and maf ‘ratio_ibd0’ : ratio of families with offsprings in ibd0 to all the fams.

threadsint, optional

Specifies the Number of threads to be used. If None there will be only one thread.

output_compressionstr

Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf. None means no compression.

output_compression_optsint

Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information. None means no compression setting.

half_windowint, optional

For each location i, the IBD inference for the haplotypes is restricted to [i-half_window, i+half_window].

ibd_thresholdfloat, optional

Minimum ratio of agreement between haplotypes for declaring IBD.

silent_progressboolean, optional

Whether it should log the percentage of imputation’s progress

use_backupboolean, optional

Whether it should use backup imputation where there is no ibd infomation available. It’s false by default.

Returns:
tuple(list, numpy.array)

The second element is imputed parental genotypes and the first element is family ids of the imputed parents(in the order of appearance in the first element).