sibreg package
Subpackages
- sibreg.bin package
- Submodules
- sibreg.bin.impute_from_sibs module
- sibreg.bin.impute_from_sibs_hdf5 module
- sibreg.bin.impute_from_sibs_setup module
- sibreg.bin.impute_po module
- sibreg.bin.impute_runner module
- sibreg.bin.make_rdr_grms module
- sibreg.bin.pGWAS module
- sibreg.bin.poGWAS module
- sibreg.bin.preprocess_data module
- sibreg.bin.sGWAS module
- sibreg.bin.triGWAS module
- Module contents
Submodules
sibreg.sibreg module
- sibreg.sibreg.compute_pgs(par_gts_f, gts_f, pgs, sib=False, compute_controls=False)[source]
Compute a polygenic score (PGS) for the individuals with observed genotypes and observed/imputed parental genotypes.
- Args:
- par_gts_f
str
path to HDF5 file with imputed parental genotypes
- gts_f
str
path to bed file with observed genotypes
- pgs
sibreg.pgs
the PGS, defined by the weights for a set of SNPs and the alleles of those SNPs
- sib
bool
Compute the PGS for genotyped individuals with at least one genotyped sibling and observed/imputed parental genotypes. Default False.
- compute_controls
bool
Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.
- par_gts_f
- Returns:
- pg
sibreg.gtarray
Return the polygenic score as a genotype array with columns: individual’s PGS, mean of their siblings’ PGS, observed/imputed paternal PGS, observed/imputed maternal PGS
- pg
- sibreg.sibreg.find_individuals_with_sibs(ids, ped, gts_ids, return_ids_only=False)[source]
Used in get_gts_matrix and get_fam_means to find the individuals in ids that have genotyped siblings.
- sibreg.sibreg.find_par_gts(pheno_ids, ped, fams, gts_id_dict)[source]
Used in get_gts_matrix to find whether individuals have imputed or observed parental genotypes, and to find the indices of the observed/imputed parents in the observed/imputed genotype arrays. ‘par_status’ codes whether an individual has parents that are observed or imputed or neither. ‘gt_indices’ records the relevant index of the parent in the observed/imputed genotype arrays ‘fam_labels’ records the family of the individual based on the pedigree
- sibreg.sibreg.fit_sibreg_model(y, X, fam_labels, add_intercept=False, tau_init=1, return_model=True, return_vcomps=True, return_fixed=True)[source]
Compute the MLE for the fixed effects in a family-based linear mixed model.
- Args:
- y
array
vector of phenotype values
- X:
array
regression design matrix for fixed effects
- fam_labels
array
vector of family labels: residual correlations in y are modelled between family members (that share a fam_label)
- add_intercept
bool
whether to add an intercept to the fixed effect design matrix
- y
- Returns:
- model
sibreg.model
the sibreg model object, if return_model=True
- vcomps:
float
the MLEs for the variance parameters: sigma2 (residual variance) and tau (ratio between sigma2 and family variance), if return_vcomps=True
- alpha
array
MLE of fixed effects, if return_fixed=True
- alpha_cov
array
sampling variance-covariance matrix for MLE of fixed effects, if return_fixed=True
- model
- sibreg.sibreg.get_fam_means(ids, ped, gts, gts_ids, remove_proband=True, return_famsizes=False)[source]
Used in get_gts_matrix to find the mean genotype in each sibship (family) for each SNP or for a PGS. The gtarray that is returned is indexed based on the subset of ids provided from sibships of size 2 or greater. If remove_proband=True, then the genotype/PGS of the index individual is removed from the fam_mean given for that individual.
- sibreg.sibreg.get_gts_matrix(par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, compute_controls=False, parsum=False, start=0, end=None, print_sample_info=False)[source]
Reads observed and imputed genotypes and constructs a family based genotype matrix for the individuals with observed/imputed parental genotypes, and if sib=True, at least one genotyped sibling.
- Args:
- par_gts_f
str
path to HDF5 file with imputed parental genotypes
- gts_f
str
path to bed file with observed genotypes
- snp_ids
numpy.ndarray
If provided, only obtains the subset of SNPs specificed that are present in both imputed and observed genotypes
- ids
numpy.ndarray
If provided, only obtains the ids with observed genotypes and imputed/observed parental genotypes (and observed sibling genotypes if sib=True)
- sib
bool
Retrieve genotypes for individuals with at least one genotyped sibling along with the average of their siblings’ genotypes and observed/imputed parental genotypes. Default False.
- compute_controls
bool
Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.
- parsum
bool
Return the sum of maternal and paternal observed/imputed genotypes rather than separate maternal/paternal genotypes. Default False.
- par_gts_f
- Returns:
- G
sibreg.gtarray
Genotype array for the subset of genotyped individuals with complete imputed/obsereved parental genotypes. The array is [N x k x L], where N is the number of individuals; k depends on whether sib=True and whether parsum=True; and L is the number of SNPs. If sib=False and parsum=False, then k=3 and this axis indexes individual’s genotypes, individual’s father’s imputed/observed genotypes, individual’s mother’s imputed/observed genotypes. If sib=True and parsum=False, then k=4, and this axis indexes the individual, the sibling, the paternal, and maternal genotypes in that order. If parsum=True and sib=False, then k=2, and this axis indexes the individual and sum of paternal and maternal genotypes; etc. If compute_controls=True, then a list is returned, where the first element is as above, and the following elements give equivalent genotyping arrays for control families where the mother has been set to missing, the father has been set to missing, and both parents have been set to missing.
- G
- sibreg.sibreg.get_gts_matrix_given_ped(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]
Used in get_gts_matrix: see get_gts_matrix for documentation
- sibreg.sibreg.get_gts_matrix_given_ped_bgen(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]
Used in get_gts_matrix: see get_gts_matrix for documentation
- sibreg.sibreg.get_indices_given_ped(ped, fams, gts_ids, ids=None, sib=False, verbose=False)[source]
Used in get_gts_matrix_given_ped to get the ids of individuals with observed/imputed parental genotypes and, if sib=True, at least one genotyped sibling. It returns those ids along with the indices of the relevant individuals and their first degree relatives in the observed genotypes (observed indices), and the indices of the imputed parental genotypes for those individuals.
- class sibreg.sibreg.gtarray(garray, ids, sid=None, alleles=None, pos=None, chrom=None, fams=None, par_status=None)[source]
Bases:
object
Define a genotype or PGS array that stores individual IDs, family IDs, and SNP information.
- Args:
- garray
array
2 or 3 dimensional numpy array of genotypes/PGS values. First dimension is individuals. For a 2 dimensional array, the second dimension is SNPs or PGS values. For a 3 dimensional array, the second dimension indexes the individual and his/her relatives’ genotypes (for example: proband, paternal, and maternal); and the third dimension is the SNPs.
- ids
array
vector of individual IDs
- sid
array
vector of SNP ids, equal in length size of last dimension of array
- alleles
array
[L x 2] matrix of ref and alt alleles for the SNPs. L must match size of sid
- pos
array
vector of SNP positions; must match size of sid
- chrom
array
vector of SNP chromosomes; must match size of sid
- fams
array
vector of family IDs; must match size of ids
- par_status:class:`~numpy:numpy.array’
[N x 2] numpy matrix that records whether parents have observed or imputed genotypes/PGS, where N matches size of ids. The first column is for the father of that individual; the second column is for the mother of that individual. If the parent is neither observed nor imputed, the value is -1; if observed, 0; and if imputed, 1.
- garray
- Returns:
G :
sibreg.gtarray
- add(garray)[source]
Adds another gtarray of the same dimension to this array and returns the sum. It matches IDs before summing.
- diagonalise(inv_root)[source]
This will transform the genotype array based on the inverse square root of the phenotypic covariance matrix from the family based linear mixed model.
- fill_NAs()[source]
This normalises the SNP columns to have mean-zero, then fills in NA values with zero.
- sibreg.sibreg.make_gts_matrix(gts, imp_gts, par_status, gt_indices, parsum=False)[source]
Used in get_gts_matrix to construct the family based genotype matrix given observed/imputed genotypes. ‘gt_indices’ has the indices in the observed/imputed genotype arrays; and par_status codes whether the parents are observed (0) or imputed (1).
- sibreg.sibreg.make_id_dict(x, col=0)[source]
Make a dictionary that maps from the values in the given column (col) to their row-index in the input array
- sibreg.sibreg.match_observed_and_imputed_snps(gts_f, par_gts_f, bim, snp_ids=None, start=0, end=None)[source]
Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs
- sibreg.sibreg.match_observed_and_imputed_snps_bgen(gts_f, par_gts_f, snp_ids=None, start=0, end=None)[source]
Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs
- sibreg.sibreg.match_phenotype(G, y, pheno_ids)[source]
Match a phenotype to a genotype array by individual IDs.
- Args:
- G
gtarray
genotype array to match phenotype to
- y
array
vector of phenotype values
- pheno_ids:
array
vector of individual IDs corresponding to phenotype vector, y
- G
- Returns:
- y
array
vector of phenotype values matched by individual IDs to the genotype array
- y
- class sibreg.sibreg.model(y, X, labels, add_intercept=False)[source]
Bases:
object
Define a linear model with within-class correlations.
- Args:
- y
array
1D array of phenotype observations
- X
array
Design matrix for the fixed mean effects.
- labels
array
1D array of sample labels
- y
- Returns:
model :
sibreg.model
- alpha_mle(tau, sigma2, compute_cov=False, xtx_out=False)[source]
Compute the MLE of alpha given variance parameters
- Args:
- sigma2
float
variance of model residuals
- tau
float
ratio of variance of model residuals to variance explained by mean differences between classes
- sigma2
- Returns:
- alpha
array
MLE of alpha
- alpha
- likelihood_and_gradient(sigma2, tau)[source]
Compute the loss function, which is -2 times the likelihood along with its gradient
- Args:
- sigma2
float
variance of model residuals
- tau
float
ratio of variance of model residuals to variance explained by mean differences between classes
- sigma2
- Returns:
- L, grad
float
loss function and gradient, divided by sample size
- L, grad
- optimize_model(init_params)[source]
Find the parameters that minimise the loss function for a given regularisation parameter
- Args:
- init_param
array
initial values for residual variance (sigma^2_epsilon) followed by ratio of residual variance to within-class variance (tau)
- init_param
- Returns:
- optim
dict
dictionary with keys: ‘success’, whether optimisation was successful (bool); ‘warnflag’, output of L-BFGS-B algorithm giving warnings; ‘sigma2’, MLE of residual variance; ‘tau’, MLE of ratio of residual variance to within-class variance; ‘likelihood’, maximum of likelihood.
- optim
- class sibreg.sibreg.pgs(snp_ids, weights, alleles)[source]
Bases:
object
Define a polygenic score based on a set of SNPs with weights and ref/alt allele pairs.
- Args:
- snp_ids
array
vector of SNP ids
- snp_ids
array
vector of weights of equal length to snp_ids
- alleles
array
[L x 2] matrix of ref and alt alleles for the SNPs. L must match size of snp_ids
- snp_ids
- Returns:
pgs :
sibreg.pgs
- compute(garray, cols=None)[source]
Compute polygenic score values from a given genotype array. Finds the SNPs in the genotype array that have weights in the pgs and matching alleles, and computes the PGS based on these SNPs and the weights after allele-matching.
- Args:
- garray
sbreg.gtarray
genotype array to compute PGS values for
- cols
numpy:numpy.array
names to give the columns in the output gtarray
- garray
- Returns:
- pg
sibreg.gtarray
2d gtarray with PGS values. If a 3d gtarray is input, then each column corresponds to the second dimension on the input gtarray (for example, individual, paternal, maternal PGS). If a 2d gtarray is input, then there will be only one column in the output gtarray. The names given in ‘cols’ are stored in ‘sid’ attribute of the output.
- pg
- sibreg.sibreg.read_phenotype(phenofile, missing_char='NA', phen_index=1)[source]
Read a phenotype file and remove missing values.
- Args:
- phenofile
str
path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, …
- missing_char
str
The character that denotes a missing phenotype value; ‘NA’ by default.
- phen_index
int
The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile
- phenofile
- Returns:
- y
array
vector of non-missing phenotype values from specified column of phenofile
- pheno_ids:
array
corresponding vector of individual IDs (IID)
- y
- sibreg.sibreg.simulate(n, alpha, sigma2, tau)[source]
- Simulate from a linear model with correlated observations within-class. The mean for each class
is drawn from a normal distribution.
- Args:
- n
int
sample size
- alpha
array
value of regression coefficeints
- sigma2
float
variance of residuals
- tau
float
ratio of variance of residuals to variance of distribution of between individual means
- n
- Returns:
- model
regrnd.model
linear model with repeated observations
- model