sibreg package

Subpackages

Submodules

sibreg.sibreg module

sibreg.sibreg.compute_pgs(par_gts_f, gts_f, pgs, sib=False, compute_controls=False)[source]

Compute a polygenic score (PGS) for the individuals with observed genotypes and observed/imputed parental genotypes.

Args:
par_gts_fstr

path to HDF5 file with imputed parental genotypes

gts_fstr

path to bed file with observed genotypes

pgssibreg.pgs

the PGS, defined by the weights for a set of SNPs and the alleles of those SNPs

sibbool

Compute the PGS for genotyped individuals with at least one genotyped sibling and observed/imputed parental genotypes. Default False.

compute_controlsbool

Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.

Returns:
pgsibreg.gtarray

Return the polygenic score as a genotype array with columns: individual’s PGS, mean of their siblings’ PGS, observed/imputed paternal PGS, observed/imputed maternal PGS

sibreg.sibreg.convert_str_array(x)[source]

Convert an ascii array to unicode array (UTF-8)

sibreg.sibreg.encode_str_array(x)[source]

Encode a unicode array as an ascii array

sibreg.sibreg.find_individuals_with_sibs(ids, ped, gts_ids, return_ids_only=False)[source]

Used in get_gts_matrix and get_fam_means to find the individuals in ids that have genotyped siblings.

sibreg.sibreg.find_par_gts(pheno_ids, ped, fams, gts_id_dict)[source]

Used in get_gts_matrix to find whether individuals have imputed or observed parental genotypes, and to find the indices of the observed/imputed parents in the observed/imputed genotype arrays. ‘par_status’ codes whether an individual has parents that are observed or imputed or neither. ‘gt_indices’ records the relevant index of the parent in the observed/imputed genotype arrays ‘fam_labels’ records the family of the individual based on the pedigree

sibreg.sibreg.fit_sibreg_model(y, X, fam_labels, add_intercept=False, tau_init=1, return_model=True, return_vcomps=True, return_fixed=True)[source]

Compute the MLE for the fixed effects in a family-based linear mixed model.

Args:
yarray

vector of phenotype values

X: array

regression design matrix for fixed effects

fam_labelsarray

vector of family labels: residual correlations in y are modelled between family members (that share a fam_label)

add_interceptbool

whether to add an intercept to the fixed effect design matrix

Returns:
modelsibreg.model

the sibreg model object, if return_model=True

vcomps: float

the MLEs for the variance parameters: sigma2 (residual variance) and tau (ratio between sigma2 and family variance), if return_vcomps=True

alphaarray

MLE of fixed effects, if return_fixed=True

alpha_covarray

sampling variance-covariance matrix for MLE of fixed effects, if return_fixed=True

sibreg.sibreg.get_fam_means(ids, ped, gts, gts_ids, remove_proband=True, return_famsizes=False)[source]

Used in get_gts_matrix to find the mean genotype in each sibship (family) for each SNP or for a PGS. The gtarray that is returned is indexed based on the subset of ids provided from sibships of size 2 or greater. If remove_proband=True, then the genotype/PGS of the index individual is removed from the fam_mean given for that individual.

sibreg.sibreg.get_gts_matrix(par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, compute_controls=False, parsum=False, start=0, end=None, print_sample_info=False)[source]

Reads observed and imputed genotypes and constructs a family based genotype matrix for the individuals with observed/imputed parental genotypes, and if sib=True, at least one genotyped sibling.

Args:
par_gts_fstr

path to HDF5 file with imputed parental genotypes

gts_fstr

path to bed file with observed genotypes

snp_idsnumpy.ndarray

If provided, only obtains the subset of SNPs specificed that are present in both imputed and observed genotypes

idsnumpy.ndarray

If provided, only obtains the ids with observed genotypes and imputed/observed parental genotypes (and observed sibling genotypes if sib=True)

sibbool

Retrieve genotypes for individuals with at least one genotyped sibling along with the average of their siblings’ genotypes and observed/imputed parental genotypes. Default False.

compute_controlsbool

Compute polygenic scores for control families (families with observed parental genotypes set to missing). Default False.

parsumbool

Return the sum of maternal and paternal observed/imputed genotypes rather than separate maternal/paternal genotypes. Default False.

Returns:
Gsibreg.gtarray

Genotype array for the subset of genotyped individuals with complete imputed/obsereved parental genotypes. The array is [N x k x L], where N is the number of individuals; k depends on whether sib=True and whether parsum=True; and L is the number of SNPs. If sib=False and parsum=False, then k=3 and this axis indexes individual’s genotypes, individual’s father’s imputed/observed genotypes, individual’s mother’s imputed/observed genotypes. If sib=True and parsum=False, then k=4, and this axis indexes the individual, the sibling, the paternal, and maternal genotypes in that order. If parsum=True and sib=False, then k=2, and this axis indexes the individual and sum of paternal and maternal genotypes; etc. If compute_controls=True, then a list is returned, where the first element is as above, and the following elements give equivalent genotyping arrays for control families where the mother has been set to missing, the father has been set to missing, and both parents have been set to missing.

sibreg.sibreg.get_gts_matrix_given_ped(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]

Used in get_gts_matrix: see get_gts_matrix for documentation

sibreg.sibreg.get_gts_matrix_given_ped_bgen(ped, par_gts_f, gts_f, snp_ids=None, ids=None, sib=False, parsum=False, start=0, end=None, verbose=False, print_sample_info=False)[source]

Used in get_gts_matrix: see get_gts_matrix for documentation

sibreg.sibreg.get_indices_given_ped(ped, fams, gts_ids, ids=None, sib=False, verbose=False)[source]

Used in get_gts_matrix_given_ped to get the ids of individuals with observed/imputed parental genotypes and, if sib=True, at least one genotyped sibling. It returns those ids along with the indices of the relevant individuals and their first degree relatives in the observed genotypes (observed indices), and the indices of the imputed parental genotypes for those individuals.

class sibreg.sibreg.gtarray(garray, ids, sid=None, alleles=None, pos=None, chrom=None, fams=None, par_status=None)[source]

Bases: object

Define a genotype or PGS array that stores individual IDs, family IDs, and SNP information.

Args:
garrayarray

2 or 3 dimensional numpy array of genotypes/PGS values. First dimension is individuals. For a 2 dimensional array, the second dimension is SNPs or PGS values. For a 3 dimensional array, the second dimension indexes the individual and his/her relatives’ genotypes (for example: proband, paternal, and maternal); and the third dimension is the SNPs.

idsarray

vector of individual IDs

sidarray

vector of SNP ids, equal in length size of last dimension of array

allelesarray

[L x 2] matrix of ref and alt alleles for the SNPs. L must match size of sid

posarray

vector of SNP positions; must match size of sid

chromarray

vector of SNP chromosomes; must match size of sid

famsarray

vector of family IDs; must match size of ids

par_status:class:`~numpy:numpy.array’

[N x 2] numpy matrix that records whether parents have observed or imputed genotypes/PGS, where N matches size of ids. The first column is for the father of that individual; the second column is for the mother of that individual. If the parent is neither observed nor imputed, the value is -1; if observed, 0; and if imputed, 1.

Returns:

G : sibreg.gtarray

add(garray)[source]

Adds another gtarray of the same dimension to this array and returns the sum. It matches IDs before summing.

compute_freqs()[source]

Computes the frequencies of the SNPs. Stored in self.freqs.

compute_info()[source]
diagonalise(inv_root)[source]

This will transform the genotype array based on the inverse square root of the phenotypic covariance matrix from the family based linear mixed model.

fill_NAs()[source]

This normalises the SNP columns to have mean-zero, then fills in NA values with zero.

filter(filter_pass)[source]
filter_ids(keep_ids, verbose=False)[source]

Keep only individuals with ids given by keep_ids

filter_info(min_info=0.99, verbose=False)[source]
filter_maf(min_maf=0.01, verbose=False)[source]

Filter SNPs based on having minor allele frequency (MAF) greater than min_maf, and have % missing observations less than max_missing.

filter_missingness(max_missing=5, verbose=False)[source]
mean_normalise()[source]

This normalises the SNPs/PGS columns to have mean-zero.

scale()[source]

This normalises the SNPs/PGS columns to have variance 1.

sibreg.sibreg.lik_and_grad(pars, *args)[source]
sibreg.sibreg.make_gts_matrix(gts, imp_gts, par_status, gt_indices, parsum=False)[source]

Used in get_gts_matrix to construct the family based genotype matrix given observed/imputed genotypes. ‘gt_indices’ has the indices in the observed/imputed genotype arrays; and par_status codes whether the parents are observed (0) or imputed (1).

sibreg.sibreg.make_id_dict(x, col=0)[source]

Make a dictionary that maps from the values in the given column (col) to their row-index in the input array

sibreg.sibreg.match_observed_and_imputed_snps(gts_f, par_gts_f, bim, snp_ids=None, start=0, end=None)[source]

Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs

sibreg.sibreg.match_observed_and_imputed_snps_bgen(gts_f, par_gts_f, snp_ids=None, start=0, end=None)[source]

Used in get_gts_matrix_given_ped to match observed and imputed SNPs and return SNP information on shared SNPs. Removes SNPs that have duplicated SNP ids. in_obs_sid contains the SNPs in the imputed genotypes that are present in the observed SNPs obs_sid_index contains the index in the observed SNPs of the common SNPs

sibreg.sibreg.match_phenotype(G, y, pheno_ids)[source]

Match a phenotype to a genotype array by individual IDs.

Args:
Ggtarray

genotype array to match phenotype to

yarray

vector of phenotype values

pheno_ids: array

vector of individual IDs corresponding to phenotype vector, y

Returns:
yarray

vector of phenotype values matched by individual IDs to the genotype array

class sibreg.sibreg.model(y, X, labels, add_intercept=False)[source]

Bases: object

Define a linear model with within-class correlations.

Args:
yarray

1D array of phenotype observations

Xarray

Design matrix for the fixed mean effects.

labelsarray

1D array of sample labels

Returns:

model : sibreg.model

alpha_mle(tau, sigma2, compute_cov=False, xtx_out=False)[source]

Compute the MLE of alpha given variance parameters

Args:
sigma2float

variance of model residuals

taufloat

ratio of variance of model residuals to variance explained by mean differences between classes

Returns:
alphaarray

MLE of alpha

likelihood_and_gradient(sigma2, tau)[source]

Compute the loss function, which is -2 times the likelihood along with its gradient

Args:
sigma2float

variance of model residuals

taufloat

ratio of variance of model residuals to variance explained by mean differences between classes

Returns:
L, gradfloat

loss function and gradient, divided by sample size

optimize_model(init_params)[source]

Find the parameters that minimise the loss function for a given regularisation parameter

Args:
init_paramarray

initial values for residual variance (sigma^2_epsilon) followed by ratio of residual variance to within-class variance (tau)

Returns:
optimdict

dictionary with keys: ‘success’, whether optimisation was successful (bool); ‘warnflag’, output of L-BFGS-B algorithm giving warnings; ‘sigma2’, MLE of residual variance; ‘tau’, MLE of ratio of residual variance to within-class variance; ‘likelihood’, maximum of likelihood.

predict(X)[source]

Predict new observations based on model regression coefficients

Args:
Xarray

matrix of covariates to predict from

Returns:
yarray

predicted values

set_alpha(alpha)[source]
sigma_inv_root(tau, sigma2)[source]
class sibreg.sibreg.pgs(snp_ids, weights, alleles)[source]

Bases: object

Define a polygenic score based on a set of SNPs with weights and ref/alt allele pairs.

Args:
snp_idsarray

vector of SNP ids

snp_idsarray

vector of weights of equal length to snp_ids

allelesarray

[L x 2] matrix of ref and alt alleles for the SNPs. L must match size of snp_ids

Returns:

pgs : sibreg.pgs

compute(garray, cols=None)[source]

Compute polygenic score values from a given genotype array. Finds the SNPs in the genotype array that have weights in the pgs and matching alleles, and computes the PGS based on these SNPs and the weights after allele-matching.

Args:
garraysbreg.gtarray

genotype array to compute PGS values for

colsnumpy:numpy.array

names to give the columns in the output gtarray

Returns:
pgsibreg.gtarray

2d gtarray with PGS values. If a 3d gtarray is input, then each column corresponds to the second dimension on the input gtarray (for example, individual, paternal, maternal PGS). If a 2d gtarray is input, then there will be only one column in the output gtarray. The names given in ‘cols’ are stored in ‘sid’ attribute of the output.

sibreg.sibreg.read_covariates(covar, missing_char='NA')[source]
sibreg.sibreg.read_phenotype(phenofile, missing_char='NA', phen_index=1)[source]

Read a phenotype file and remove missing values.

Args:
phenofilestr

path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, …

missing_charstr

The character that denotes a missing phenotype value; ‘NA’ by default.

phen_indexint

The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile

Returns:
yarray

vector of non-missing phenotype values from specified column of phenofile

pheno_ids: array

corresponding vector of individual IDs (IID)

sibreg.sibreg.simulate(n, alpha, sigma2, tau)[source]
Simulate from a linear model with correlated observations within-class. The mean for each class

is drawn from a normal distribution.

Args:
nint

sample size

alphaarray

value of regression coefficeints

sigma2float

variance of residuals

taufloat

ratio of variance of residuals to variance of distribution of between individual means

Returns:
modelregrnd.model

linear model with repeated observations

Module contents