sibreg package

Subpackages

sibreg.bin package

Submodules

sibreg.sibreg module

sibreg.sibreg.compute_pgs(par_gts_f, gts_f, pgs, sib=False, compute_controls=False)[source]

Compute a polygenic score (PGS) for the individuals with observed genotypes and observed/imputed parental genotypes.

Args:

par_gts_fstr: path to HDF5 file with imputed parental genotypes
gts_fstr: path to bed file with observed genotypes
pgssibreg.pgs: the PGS, defined by the weights for a set of SNPs and the alleles of those SNPs
sibbool: Compute the PGS for genotyped individuals with at least one genotyped sibling and observed/imputed parental genotypes. Default False.
compute_controlsbool: Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.

Returns:

pgsibreg.gtarray: Return the polygenic score as a genotype array with columns: individual’s PGS, mean of their siblings’ PGS, observed/imputed paternal PGS, observed/imputed maternal PGS

sibreg.sibreg.convert_str_array(x)[source]: Convert an ascii array to unicode array (UTF-8)

sibreg.sibreg.encode_str_array(x)[source]: Encode a unicode array as an ascii array

sibreg.sibreg.find_individuals_with_sibs(ids, ped, gts_ids, return_ids_only=False)[source]: Used in get_gts_matrix and get_fam_means to find the individuals in ids that have genotyped siblings.

sibreg.sibreg.find_par_gts(pheno_ids, ped, fams, gts_id_dict)[source]: Used in get_gts_matrix to find whether individuals have imputed or observed parental genotypes, and to find the indices of the observed/imputed parents in the observed/imputed genotype arrays. ‘par_status’ codes whether an individual has parents that are observed or imputed or neither. ‘gt_indices’ records the relevant index of the parent in the observed/imputed genotype arrays ‘fam_labels’ records the family of the individual based on the pedigree

sibreg.sibreg.fit_sibreg_model(y, X, fam_labels, add_intercept=False, tau_init=1, return_model=True, return_vcomps=True, return_fixed=True)[source]

Compute the MLE for the fixed effects in a family-based linear mixed model.

Args:

yarray: vector of phenotype values
X: array: regression design matrix for fixed effects
fam_labelsarray: vector of family labels: residual correlations in y are modelled between family members (that share a fam_label)
add_interceptbool: whether to add an intercept to the fixed effect design matrix

Returns:

modelsibreg.model: the sibreg model object, if return_model=True
vcomps: float: the MLEs for the variance parameters: sigma2 (residual variance) and tau (ratio between sigma2 and family variance), if return_vcomps=True
alphaarray: MLE of fixed effects, if return_fixed=True
alpha_covarray: sampling variance-covariance matrix for MLE of fixed effects, if return_fixed=True

sibreg.sibreg.get_fam_means(ids, ped, gts, gts_ids, remove_proband=True, return_famsizes=False)[source]: Used in get_gts_matrix to find the mean genotype in each sibship (family) for each SNP or for a PGS. The gtarray that is returned is indexed based on the subset of ids provided from sibships of size 2 or greater. If remove_proband=True, then the genotype/PGS of the index individual is removed from the fam_mean given for that individual.

sibreg.sibreg.get_gts_matrix(par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, compute_controls=False, parsum=False, start=0, end=None, print_sample_info=False)[source]

Reads observed and imputed genotypes and constructs a family based genotype matrix for the individuals with observed/imputed parental genotypes, and if sib=True, at least one genotyped sibling.

Args:

par_gts_fstr: path to HDF5 file with imputed parental genotypes
gts_fstr: path to bed file with observed genotypes
snp_idsnumpy.ndarray: If provided, only obtains the subset of SNPs specificed that are present in both imputed and observed genotypes
idsnumpy.ndarray: If provided, only obtains the ids with observed genotypes and imputed/observed parental genotypes (and observed sibling genotypes if sib=True)
sibbool: Retrieve genotypes for individuals with at least one genotyped sibling along with the average of their siblings’ genotypes and observed/imputed parental genotypes. Default False.
compute_controlsbool: Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.
parsumbool: Return the sum of maternal and paternal observed/imputed genotypes rather than separate maternal/paternal genotypes. Default False.

Returns:

Gsibreg.gtarray: Genotype array for the subset of genotyped individuals with complete imputed/obsereved parental genotypes. The array is [N x k x L], where N is the number of individuals; k depends on whether sib=True and whether parsum=True; and L is the number of SNPs. If sib=False and parsum=False, then k=3 and this axis indexes individual’s genotypes, individual’s father’s imputed/observed genotypes, individual’s mother’s imputed/observed genotypes. If sib=True and parsum=False, then k=4, and this axis indexes the individual, the sibling, the paternal, and maternal genotypes in that order. If parsum=True and sib=False, then k=2, and this axis indexes the individual and sum of paternal and maternal genotypes; etc. If compute_controls=True, then a list is returned, where the first element is as above, and the following elements give equivalent genotyping arrays for control families where the mother has been set to missing, the father has been set to missing, and both parents have been set to missing.

sibreg.sibreg.get_gts_matrix_given_ped(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]: Used in get_gts_matrix: see get_gts_matrix for documentation

sibreg.sibreg.get_gts_matrix_given_ped_bgen(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]: Used in get_gts_matrix: see get_gts_matrix for documentation

sibreg.sibreg.get_indices_given_ped(ped, fams, gts_ids, ids=None, sib=False, verbose=False)[source]: Used in get_gts_matrix_given_ped to get the ids of individuals with observed/imputed parental genotypes and, if sib=True, at least one genotyped sibling. It returns those ids along with the indices of the relevant individuals and their first degree relatives in the observed genotypes (observed indices), and the indices of the imputed parental genotypes for those individuals.

class sibreg.sibreg.gtarray(garray, ids, sid=None, alleles=None, pos=None, chrom=None, fams=None, par_status=None)[source]

Bases: object

Define a genotype or PGS array that stores individual IDs, family IDs, and SNP information.

Args:

garrayarray: 2 or 3 dimensional numpy array of genotypes/PGS values. First dimension is individuals. For a 2 dimensional array, the second dimension is SNPs or PGS values. For a 3 dimensional array, the second dimension indexes the individual and his/her relatives’ genotypes (for example: proband, paternal, and maternal); and the third dimension is the SNPs.
idsarray: vector of individual IDs
sidarray: vector of SNP ids, equal in length size of last dimension of array
allelesarray: [L x 2] matrix of ref and alt alleles for the SNPs. L must match size of sid
posarray: vector of SNP positions; must match size of sid
chromarray: vector of SNP chromosomes; must match size of sid
famsarray: vector of family IDs; must match size of ids
par_status:class:`~numpy:numpy.array’: [N x 2] numpy matrix that records whether parents have observed or imputed genotypes/PGS, where N matches size of ids. The first column is for the father of that individual; the second column is for the mother of that individual. If the parent is neither observed nor imputed, the value is -1; if observed, 0; and if imputed, 1.

Returns:

G : sibreg.gtarray

add(garray)[source]: Adds another gtarray of the same dimension to this array and returns the sum. It matches IDs before summing.

compute_freqs()[source]: Computes the frequencies of the SNPs. Stored in self.freqs.

compute_info()[source]

diagonalise(inv_root)[source]: This will transform the genotype array based on the inverse square root of the phenotypic covariance matrix from the family based linear mixed model.

fill_NAs()[source]: This normalises the SNP columns to have mean-zero, then fills in NA values with zero.

filter(filter_pass)[source]

filter_ids(keep_ids, verbose=False)[source]: Keep only individuals with ids given by keep_ids

filter_info(min_info=0.99, verbose=False)[source]

filter_maf(min_maf=0.01, verbose=False)[source]: Filter SNPs based on having minor allele frequency (MAF) greater than min_maf, and have % missing observations less than max_missing.

filter_missingness(max_missing=5, verbose=False)[source]

mean_normalise()[source]: This normalises the SNPs/PGS columns to have mean-zero.

scale()[source]: This normalises the SNPs/PGS columns to have variance 1.

sibreg.sibreg.lik_and_grad(pars, *args)[source]

sibreg.sibreg.make_gts_matrix(gts, imp_gts, par_status, gt_indices, parsum=False)[source]: Used in get_gts_matrix to construct the family based genotype matrix given observed/imputed genotypes. ‘gt_indices’ has the indices in the observed/imputed genotype arrays; and par_status codes whether the parents are observed (0) or imputed (1).

sibreg.sibreg.make_id_dict(x, col=0)[source]: Make a dictionary that maps from the values in the given column (col) to their row-index in the input array

sibreg.sibreg.match_observed_and_imputed_snps(gts_f, par_gts_f, bim, snp_ids=None, start=0, end=None)[source]: Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs

sibreg.sibreg.match_observed_and_imputed_snps_bgen(gts_f, par_gts_f, snp_ids=None, start=0, end=None)[source]: Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs

sibreg.sibreg.match_phenotype(G, y, pheno_ids)[source]

Match a phenotype to a genotype array by individual IDs.

Args:

Ggtarray: genotype array to match phenotype to
yarray: vector of phenotype values
pheno_ids: array: vector of individual IDs corresponding to phenotype vector, y

Returns:

yarray: vector of phenotype values matched by individual IDs to the genotype array

class sibreg.sibreg.model(y, X, labels, add_intercept=False)[source]

Bases: object

Define a linear model with within-class correlations.

Args:

yarray: 1D array of phenotype observations
Xarray: Design matrix for the fixed mean effects.
labelsarray: 1D array of sample labels

Returns:

model : sibreg.model

alpha_mle(tau, sigma2, compute_cov=False, xtx_out=False)[source]

Compute the MLE of alpha given variance parameters

Args:

sigma2float: variance of model residuals
taufloat: ratio of variance of model residuals to variance explained by mean differences between classes

Returns:

alphaarray: MLE of alpha

likelihood_and_gradient(sigma2, tau)[source]

Compute the loss function, which is -2 times the likelihood along with its gradient

Args:

sigma2float: variance of model residuals
taufloat: ratio of variance of model residuals to variance explained by mean differences between classes

Returns:

L, gradfloat: loss function and gradient, divided by sample size

optimize_model(init_params)[source]

Find the parameters that minimise the loss function for a given regularisation parameter

Args:

init_paramarray: initial values for residual variance (sigma^2_epsilon) followed by ratio of residual variance to within-class variance (tau)

Returns:

optimdict: dictionary with keys: ‘success’, whether optimisation was successful (bool); ‘warnflag’, output of L-BFGS-B algorithm giving warnings; ‘sigma2’, MLE of residual variance; ‘tau’, MLE of ratio of residual variance to within-class variance; ‘likelihood’, maximum of likelihood.

predict(X)[source]

Predict new observations based on model regression coefficients

Args:

Xarray: matrix of covariates to predict from

Returns:

yarray: predicted values

set_alpha(alpha)[source]

sigma_inv_root(tau, sigma2)[source]

class sibreg.sibreg.pgs(snp_ids, weights, alleles)[source]

Bases: object

Define a polygenic score based on a set of SNPs with weights and ref/alt allele pairs.

Args:

snp_idsarray: vector of SNP ids
snp_idsarray: vector of weights of equal length to snp_ids
allelesarray: [L x 2] matrix of ref and alt alleles for the SNPs. L must match size of snp_ids

Returns:

pgs : sibreg.pgs

compute(garray, cols=None)[source]

Compute polygenic score values from a given genotype array. Finds the SNPs in the genotype array that have weights in the pgs and matching alleles, and computes the PGS based on these SNPs and the weights after allele-matching.

Args:

garraysbreg.gtarray: genotype array to compute PGS values for
colsnumpy:numpy.array: names to give the columns in the output gtarray

Returns:

pgsibreg.gtarray: 2d gtarray with PGS values. If a 3d gtarray is input, then each column corresponds to the second dimension on the input gtarray (for example, individual, paternal, maternal PGS). If a 2d gtarray is input, then there will be only one column in the output gtarray. The names given in ‘cols’ are stored in ‘sid’ attribute of the output.

sibreg.sibreg.read_covariates(covar, missing_char='NA')[source]

sibreg.sibreg.read_phenotype(phenofile, missing_char='NA', phen_index=1)[source]

Read a phenotype file and remove missing values.

Args:

phenofilestr: path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, …
missing_charstr: The character that denotes a missing phenotype value; ‘NA’ by default.
phen_indexint: The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile

Returns:

yarray: vector of non-missing phenotype values from specified column of phenofile
pheno_ids: array: corresponding vector of individual IDs (IID)

sibreg.sibreg.simulate(n, alpha, sigma2, tau)[source]

Simulate from a linear model with correlated observations within-class. The mean for each class

is drawn from a normal distribution.

Args:

nint: sample size
alphaarray: value of regression coefficeints
sigma2float: variance of residuals
taufloat: ratio of variance of residuals to variance of distribution of between individual means

Returns:

modelregrnd.model: linear model with repeated observations

sibreg package

Subpackages

Submodules

sibreg.sibreg module

Module contents