snipar.imputation.preprocess_data module

Contains functions for preprocessing the data for the imputation

Classes

Person

Functions

recurcive_append create_pedigree add_control preprocess_king prepare_data compute_aics estimate_f prepare_gts

class snipar.imputation.preprocess_data.Person(id, fid=None, pid=None, mid=None)[source]

Bases: object

Just a simple data structure representing individuals

Args:
idstr

IID of the individual.

fidstr

FID of the individual.

pidstr

IID of the father of that individual.

midstr

IID of the mother of that individual.

snipar.imputation.preprocess_data.add_control(pedigree)[source]

Adds control families to the pedigree table for testing.

For each family that has two or more siblings and both parents, creates a 3 new familes, one has no parents, one with no mother and one with no father. gFID of these families are x+original_fid where x is “_o_”, “_p_”, “_m_” for these cases: no parent, only has father, only has mother. IIDs are the same in both families.

Args:
pedigreepd.DataFrame

A pedigree table with ‘FID’, ‘IID’, ‘FATHER_ID’, ‘MOTHER_ID’, ‘has_father’, ‘has_mother’. Each row represents an individual. fids starting with “_” are reserved for control.

Returns:
pd.DataFrame

A pedigree table with ‘FID’, ‘IID’, ‘FATHER_ID’, ‘MOTHER_ID’. Each row represents an individual. For each family with both parents and more than one offspring, it has a control family(fids for control families start with ‘_’)

snipar.imputation.preprocess_data.compute_aics(unphased_gts, pc_scores, linear=True, sample_size=1000)[source]

Akaike information criterion of linear regressions with increasing number of PCs. Returns the number of PCs that minimizes aic. Args:

unphased_gts: np.array[signed char]

A two-dimensional array containing genotypes for all individuals and SNPs respectively.

pc_scores: np.array[float]

A two-dimensional array containing pc scores for all individuals and SNPs respectively.

linearbool, optional

Whether the model is linear regression or not. GLM is not implemented yet. Default is true.

sample_sizeint, optional

number of snps to use for computing aics. aics from these snps are averaged. default is 1000.

Returns:
int:

optimal number of PCs to use

snipar.imputation.preprocess_data.estimate_f(unphased_gts, pc_scores, linear=True)[source]

Estimates MAF with an ols or glm from by regressing unphased_gts on pc_scores

Args:
unphased_gts: np.array[signed char]

A two-dimensional array containing genotypes for all individuals and SNPs respectively.

pc_scores: np.array[float]

A two-dimensional array containing pc scores for all individuals and SNPs respectively.

linear, bool, optional

Whether the model is linear regression or not. Default is true.

Returns:
np.array[float], dict

A two-dimensional array containing estimated fs for all individuals and SNPs respectively and a dictionary containing information about the model. These include [‘x’, ‘coefs’, ‘TSS’, ‘RSS1’, ‘RSS2’, ‘R2_1’, ‘R2_2’, ‘larger1’, ‘less0’].

snipar.imputation.preprocess_data.prepare_data(pedigree, phased_address, unphased_address, ibd_address, ibd_is_king, bim_address=None, fam_address=None, control=False, chromosome=None, pedigree_nan='0')[source]

Processes the non_gts required data for the imputation and returns it.

Outputs for used for the imputation have ascii bytes instead of strings.

Args:
pedigreepd.DataFrame

The pedigree table. It contains ‘FID’, ‘IID’, ‘FATHER_ID’ and, ‘MOTHER_ID’ columns.

phased_addressstr

Address of the phased bgen file (does not inlude ‘.bgen’). Only one of unphased_address and phased_address is neccessary.

unphased_addressstr

Address of the bed file (does not inlude ‘.bed’). Only one of unphased_address and phased_address is neccessary.

ibd_addressstr

address of the ibd file. The king segments file should be accompanied with an allsegs file.

ibd_is_kingboolean

Whether the ibd segments are in king format or snipar format.

bim_addressstr, optional

Address of the bim file if it’s different from the address of the bed file. Does not include ‘.bim’.

fam_addressstr, optional

Address of the fam file if it’s different from the address of the bed file. Does not include ‘.fam’.

controlboolean, optional

If True, adds control families to the pedigree table for testing using snipar.imputation.preprocess_data.add_control.

chromosome: str, optional

Number of the chromosome that’s going to be loaded.

pedigree_nan: str, optional

Value that’s considered nan in the pedigree. The default is ‘0’

Returns:
tuple(pandas.Dataframe, dict, numpy.ndarray, pandas.Dataframe, numpy.ndarray, numpy.ndarray)
Returns the data required for the imputation. This data is a tuple of multiple objects.
sibships: pandas.DataFrame

A pandas DataFrame with columns [‘FID’, ‘FATHER_ID’, ‘MOTHER_ID’, ‘IID’, ‘has_father’, ‘has_mother’, ‘single_parent’] where IID columns is a list of the IIDs of individuals in that family. It only contains families that have more than one child or only one parent.

ibd: pandas.DataFrame

A pandas DataFrame with columns “ID1”, “ID2”, ‘segment’. The segments column is a list of IBD segments between ID1 and ID2. Each segment consists of a start, an end, and an IBD status. The segment list is flattened meaning it’s like [start0, end0, ibd_status0, start1, end1, ibd_status1, …]

bim: pandas.DataFrame

A dataframe with these columns(dtype str): Chr id morgans coordinate allele1 allele2

chromosomes: str

A string containing all the chromosomes present in the data.

ped_ids: set

Set of ids of individuals with missing parents.

pedigree_output: np.array

Pedigree with added parental status.

snipar.imputation.preprocess_data.prepare_gts(phased_address, unphased_address, bim, pedigree_output, ped_ids, chromosomes, start=None, end=None, pcs=None, pc_ids=None, find_optimal_pc=None)[source]

Processes the gts required data for the imputation and returns it.

Outputs for used for the imputation have ascii bytes instead of strings.

Args:
phased_addressstr

Address of the phased bgen file (does not inlude ‘.bgen’). Only one of unphased_address and phased_address is neccessary.

unphased_addressstr

Address of the bed file (does not inlude ‘.bed’). Only one of unphased_address and phased_address is neccessary.

bim: pandas.DataFrame

A dataframe with these columns(dtype str): Chr id morgans coordinate allele1 allele2

pedigree_output: np.array

Pedigree with added parental status.

ped_ids: set

Set of ids of individuals with missing parents.

chromosomes: str

A string containing all the chromosomes present in the data.

startint, optional

This function can be used for preparing a slice of a chromosome. This is the location of the start of the slice.

endint, optional

This function can be used for preparing a slice of a chromosome. This is the location of the end of the slice.

pcsnp.array[float], optional

A two-dimensional array containing pc scores for all individuals and SNPs respectively.

pc_idsset, optional

Set of ids of individuals in the pcs.

find_optimal_pcbool, optional

It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.

Returns:
tuple(np.array[signed char], np.array[signed char], str->int, np.array[int], np.array[float], dict)
phased_gts: np.array[signed char], optional

A three-dimensional array containing genotypes for all individuals, SNPs and, haplotypes respectively.

unphased_gts: np.array[signed char]

A two-dimensional array containing genotypes for all individuals and SNPs respectively.

iid_to_bed_index: str->int

A str->int dictionary mapping IIDs of people to their location in bed file.

pos: np.array[int]

A numpy array with the position of each SNP in the order of appearance in gts.

practical_f: np.array[float]

A two-dimensional array containing estimated fs for all individuals and SNPs respectively.

hdf5_output_dict: dict
A dictionary whose values will be written in the imputation output under its keys. It contains:

‘bim_columns’ : Columns of the resulting bim file ‘bim_values’ : Contents of the resulting bim file ‘pedigree’ : pedigree table Its columns are has_father, has_mother, single_parent respectively. ‘non_duplicates’ : Indexes of the unique snps. Imputation is restricted to them. ‘standard_f’ : Whether the allele frequencies are just population average instead of MAFs estimated using PCs ‘MAF_*’ : info about the MAF estimator if MAF estimator is used.

snipar.imputation.preprocess_data.preprocess_king(ibd, segs, bim, chromosomes, sibships)[source]
Converts the ibds in king format to ibds in snipar format

King format only saves ibd1 and ibd2s in the ibd file. The rest is ibd0 only if present in the segs file. This function finds the ibd0 sections and appends to the ibd data structure.

Args:
ibd: pd.DataFrame

A pandas DataFrame with columns including [“ID1”, “ID2”, “IBDType”, “Chr”, “StartSNP”, “StopSNP”] where IDs are individual IIDs.

segs: pd.DataFrame

A pandas DataFrame with columns including [“Segment”, “Chr”, “StartSNP”, “StopSNP”]

bim: pd.DataFrame

A dataframe with these columns(dtype str) including: Chr id coordinate

chromosomes: list

list of chromosome numbers

sibships: pandas.DataFrame

A pandas DataFrame with columns [‘FID’, ‘FATHER_ID’, ‘MOTHER_ID’, ‘IID’, ‘has_father’, ‘has_mother’, ‘single_parent’] where IID columns is a list of the IIDs of individuals in that family. It only contains families that have more than one child or only one parent.

Returns:
(str, str) -> list

A dictionary where the keys are pairs of individual ids and the values are IBD segments. The list is flattened as has information about successive ibd segments meaning it’s like [start0, end0, ibd_status0, start1, end1, ibd_status1, …]

snipar.imputation.preprocess_data.recurcive_append(dictionary, index, element)[source]

Adds an element to value of all the keys that can be reached from index with using get recursively.

Args:
dictionarydict

A dictionary of objects to list

index

The start point

element

What should be added to values