Tutorial

Tutorial on inferring IBD between siblings, imputing missing parental genotypes, and performing family based GWAS and polygenic score analyses. Before working through the tutorial, please first install the package and read the guide.

Test data

If snipar has been installed succesfully, the command line scripts should be accessible as executables in your terminal. A script that should be accessible loads the tutorial example data into a specified directory. To create a directory called ‘example_data/’ in the current directory and load the example data into it, use the command:

snipar_example_data.py --dest example_data

You can create the example data directory elsewhere by changing the –dest argument. Please change your working directory to example_data/:

cd example_data

In this directory, there is some example data. The file phenotype.txt is a phenotype file containing a simulated phenotype with direct, paternal, and maternal effects, where 80% of the phenotypic variance is explained by the combined direct, paternal and maternal effects of the SNPs; and the pairwise correlations between the direct, paternal, and maternal effects are 0.5.

The genotype data has been simulated so that there are 3000 independent families, where 1000 have two siblings but no parents genotyped, 1000 have one parent genotyped and a 50% chance of having a genotyped sibling, and the final 1000 have both parents genotyped and a 50% chance of having a genotyped sibling. The example data includes observed genotype data formatted in both PLINK .bed format (chr_1.bed) and phased genotype data in .bgen format (chr_1.bgen with associated sample file chr_1.sample).

Inferring IBD between siblings

The first step is to infer the identity-by-descent (IBD) segments shared between siblings. snipar contains a script, ibd.py, that employs a Hidden Markov Model (HMM) to infer the IBD segments for the sibling pairs. The per-SNP genotyping error probability will be inferred from parent-offspring pairs when available; alternatively, a genotyping error probability can be provided using the –p_error option. By default, SNPs with genotyping error rates greater than 0.01 will be filtered out, but this threshold can be changed with the –max_error argument. To infer the IBD segments from the genotype data in chr_1.bed,use the following command

ibd.py --bed chr_@ --king king.kin0 --agesex agesex.txt --out chr_@ --threads 4 --ld_out

This will output the IBD segments to a gzipped text file chr_1.ibd.segments.gz. Genotype files split over multiple chromosomes can be specified using ‘@’ as a numerical wildcard character: see here. In this example, –bed chr_@ instructs ibd.py to search for .bed files chr_1.bed, chr_2.bed, …, chr_22.bed, where each bed file contains SNPs from the numbered chromosome. In this case, only one bed file is in example_data/, chr_1.bed. If bed files for multiple chromosomes are found, IBD will be inferred separately for each chromosome, with one output file per chromosome, with the chromosome number filling in the numerical wildcard in the –out argument.

The –king argument requires the address of the KING kinship file, and the –agesex argument requires the address of the agesex file.

The algorithm requires a genetic map to compute the probabilities of transitioning between different IBD states. If the genetic map positions (in cM) are provided in .bim file, the script will use these. Alternatively, the –map argument allows the user to specify a genetic map in the same format as used by SHAPEIT (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#formats) an example of which is provided in genetic_map.txt.

If no genetic map is provided, then the deCODE sex-averaged map on GRCh38 coordinates (Halldorsson, Bjarni V., et al. “Characterizing mutagenic effects of recombination through a sequence-level genetic map.” Science 363.6425 (2019).), which is distributed as part of snipar, will be used.

The algorithm computes LD scores of SNPs in order to account for correlations between SNPs. The ‘–ld_out’ argument writes the LD scores to file in the same format as LDSC (https://github.com/bulik/ldsc).

The user can also input a phased .bgen file. For example, to infer IBD from chr_1.bgen using the genetic map in genetic_map.txt, use this command:

ibd.py --bgen chr_@ --king king.kin0 --agesex agesex.txt --out chr_@ --threads 4 --ld_out --map genetic_map.txt

If the user has a pedigree file, they can input that instead of the –king and –agesex arguments. Siblings are inferred as individuals in the pedigree that share both parents. Using the example pedigree in pedigree.txt, you can infer IBD using this command:

ibd.py --bed chr_@ --pedigree pedigree.txt --map genetic_map.txt --out chr_@ --threads 4 --ld_out

Imputing missing parental genotypes

This is performed using the impute.py script. To impute the missing parental genotypes without using phase information, use this command:

impute.py --ibd chr_@.ibd --bed chr_@ --king king.kin0 --agesex agesex.txt --out chr_@ --threads 4

The script constructs a pedigree from the output of KING’s relatedness inference (king.kin0), and age and sex information (agesex.txt). The pedigree along with the IBD segments shared between siblings recorded in chr_1.ibd.segments.gz are used to impute missing parental genotypes from the observed sibling and parental genotypes in chr_1.bed. The imputed parental genotypes are output to a HDF5 file, chr_1.hdf5.

If phased haplotypes are available in .bgen format, the imputation can use these as input, which improves the accuracy of the imputation. To perform imputation from the phased .bgen file in example_data/, use the following command:

impute.py --ibd chr_@.ibd --bgen chr_@ --king king.kin0 --agesex agesex.txt --out chr_@ --threads 4

As with the ibd.py script, the impute_runner.py script can use a user input pedigree file (with the –pedigree argument) rather than the –king and –agesex arguments.

Family based GWAS

This is performed using the gwas.py script. To compute summary statistics for direct effects, non-transmitted coefficients (NTCs), and population effects for the SNPs in the .bed file, use this command:

gwas.py phenotype.txt --bed chr_@ --imp chr_@ --threads 4

This takes the observed genotypes in chr_1.bed and the imputed parental genotypes in chr_1.hdf5 and uses them to perform, for each SNP, a joint regression onto the proband’s genotype, the father’s (imputed/observed) genotype, and the mother’s (imputed/observed) genotype. This is done using a linear mixed model that models phenotypic correlations between siblings, where sibling relations are stored in the output of the imputation script. The ‘family variance estimate’ output is the phenotypic variance explained by mean differences between sibships, and the residual variance is the remaining phenotypic variance.

To use the .bgen file instead, use this command:

gwas.py phenotype.txt --bgen chr_@ --imp chr_@ --threads 4

The script outputs summary statistics in a gzipped text file: chr_1.sumstats.gz. In addition to the text summary statistics, HDF5 format summary statistics are also output to chr_1.sumstats.hdf5

Now we have estimated SNP effects. To compare to the true effects, run

python estimate_sim_effects.py chr_1.sumstats.hdf5 phenotype.effects.txt

This should print estimates of the bias of the effect estimates.

The bias estimates for direct, paternal NTCs, maternal NTCs, and average NTCs should not be statistically significantly different from zero (with high probability). Population effects (as estimated by standard GWAS) are biased estimates of direct effects for this simulated phenotype because they also include indirect genetic effects.

GWAS can also be performed without imputed parental genotypes. In this case, only probands with genotypes for both parents available will be used. In order to do this, one must provide a pedigree to gwas.py, as in:

gwas.py phenotype.txt --out trios_ --bgen chr_@ --pedigree pedigree.txt --threads 4

Correlations between effects

snipar provides a script (correlate.py) to compute correlations between direct and population effects and between direct effects and average NTCs. To compute these correlations from the effects estimated in this tutorial (output by gwas.py to chr_1.sumstats.gz) using the LD scores computed by ibd.py (and output to chr_1.l2.ldscore.gz), use the following command:

correlate.py chr_@ effect --ldscores chr_@

This should give a correlation between direct effects and average NTCs of close to 0.5. The estimated correlations and their standard errors, estimated by block-jacknife, are output to effect_corrs.txt.

The method is similar to LDSC, but correlates the marginal effects (not joint-fit effects adjusted for population stratification, as LDSC attempts to use), adjusting for the known sampling variance-covariance matrix of the effects. The LD scores are used for weighting. LD scores output by LDSC can be input. If LD scores are not available, they can be computed from .bed files by providing them through the –bed argument to correlate.py.

Polygenic score analyses

For an exercise involving polygenic score analysis, please see the Simulation Exercse.