Command Line Scripts

ibd.py

Infers identity-by-descent (IBD) segments shared between full-siblings.

Minimally: the script requires observed sibling genotypes in either .bed or .bgen format, along with information on the relations present in the dataset, which can be provided using a pedigree file or the results of KING kinship inference along with age and sex information (from which a pedigree can be constructed).

Args:

‘-h’, ‘–help’, default===SUPPRESS==: show this help message and exit
‘–bgen’str: Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
‘–bed’str: Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
‘–chr_range’: number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
‘–king’str: Address of the king file
‘–agesex’str: Address of file with age and sex information
‘–pedigree’str: Address of pedigree file
‘–map’str: None
‘–out’str, default=ibd: The IBD segments will output to this path, one file for each chromosome. If the path contains ‘#’, the ‘#’ will be replaced with the chromosome number. Otherwise, the segments will be output to the given path with file names chr_1.ibd.segments.gz, chr_2.segments.gz, etc.
‘–p_error’float: Probability of genotyping error. By default, this is estimated from genotyped parent-offspring pairs.
‘–min_length’float, default=0.01: Smooth segments with length less than min_length (cM)
‘–threads’int: Number of threads to use for IBD inference. Uses all available by default.
‘–min_maf’float, default=0.01: Minimum minor allele frequency
‘–max_missing’float, default=5: Ignore SNPs with greater percent missing calls than max_missing (default 5)
‘–max_error’float, default=0.01: Maximum per-SNP genotyping error probability
‘–ibdmatrix’: Output a matrix of SNP IBD states (in addition to segments file)
‘–ld_out’: Output LD scores of SNPs (used internally for weighting).
‘–chrom’int: The chromosome of the input .bgen file. Helpful if inputting a single .bgen file without chromosome information.
‘–batches’int, default=1: Number of batches to split the data (by sibpair) into for IBD inference. Useful for large datasets.

Results:

IBD segments: For each chromosome, a gzipped text file containing the IBD segments for the siblings is output.

impute.py

This script performs imputation of missing parental genotypes from observed genotypes in a family. It can impute missing parents from families with no genotyped parents but at least two genotyped siblings, or one genotyped parent and one or more genotyped offspring. To specify the siblings, one can either provide a pedigree file (–pedigree option) or

the relatedness inference output from KING with the –related –degree 1 options along with age and sex information.

The pedigree file is a plain text file with header and columns: FID (family ID), IID (individual ID), FATHER_ID (ID of father), MOTHER_ID (ID of mother). Note that individuals are assumed to have unique individual IDS (IID). Siblings are identified through individuals that have the same FID and the same FATHER_ID and MOTHER_ID.

Use the –king option to provide the KING relatedness inference output (usually has suffix .kin0) and the –agesex option to provide the age & sex information. The script constructs a pedigree from this information and outputs it in the HDF5 output.

Args:

‘-h’, ‘–help’, default===SUPPRESS==: show this help message and exit
‘-c’: Duplicates offsprings of families with more than one offspring and both parents and add ‘_’ to the start of their FIDs. These can be used for testing the imputation. The tests.test_imputation.imputation_test uses these.
‘-silent_progress’: Hides the percentage of progress from logging
‘-use_backup’: Whether it should use backup imputation where there is no ibd infomation available
‘–ibd’str: Address of the IBD file without suffix. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–ibd_is_king’: If not provided the ibd input is assumed to be in snipar. Otherwise its in king format with an allsegs file
‘–bgen’str: Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–bed’str: Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome(chr_range is an optional parameters for this script).
‘–chr_range’: number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
‘–bim’str: Address of a bim file containing positions of SNPs if the address is different from Bim file of genotypes
‘–fam’str: Address of a fam file containing positions of SNPs if the address is different from fam file of genotypes
‘–out’str, default=parent_imputed: Writes the result of imputation for chromosome i to outprefix{i}
‘–start’int: The script can do the imputation on a slice of each chromosome. This is the start of that slice(it is inclusive)
‘–end’int: The script can do the imputation on a slice of each chromosome. This is the end of that slice(it is inclusive).
‘–pedigree’str: Address of the pedigree file. Pedigree file is a ‘ ‘ seperated csv with columns ‘FID’, ‘IID’, ‘FATHER_ID’, ‘MOTHER_ID’. Default NaN value of Pedigree file is ‘0’. If your NaN value is something else be sure to specify it with –pedigree_nan option.
‘–king’str: Address of a kinship file in KING format. kinship file is a ‘ ‘ seperated csv with columns “FID1”, “ID1”, “FID2”, “ID2, “InfType”.

Each row represents a relationship between two individuals. InfType column states the relationship between two individuals. The only relationships that matter for this script are full sibling and parent-offspring which are shown by ‘FS’ and ‘PO’ respectively. This file is used in creating a pedigree file and can be generated using KING.
‘–agesex’str: Address of the agesex file. This is a ” ” seperated CSV with columns “FID”, “IID”, “FATHER_ID”, “MOTHER_ID”, “sex”, “age”.

Each row contains the age and sex of one individual. Male and Female sex should be represented with ‘M’ and ‘F’. Age column is used for distinguishing between parent and child in a parent-offsring relationship inferred from the kinship file. ID1 is a parent of ID2 if there is a ‘PO’ relationship between them and ‘ID1’ is at least 12 years older than ID2.
‘–pcs’str: Address of the PCs file with header “FID IID PC1 PC2 …”. If provided MAFs will be estimated from PCs
‘–pc_num’int: Number of PCs to consider
‘-find_optimal_pc’: It will use Akaike information criterion to find the optimal number of PCs to use for MAF estimation.
‘–threads’int, default=1: Number of the threads to be used. This should not exceed number of the available cores. The default number of the threads is one.
‘–processes’int, default=1: Number of processes for imputation chromosomes. Each chromosome is done on one process.
‘–chunks’int, default=1: Number of chunks load data in(each process).
‘–output_compression’str: Optional compression algorithm used in writing the output as an hdf5 file. It can be either gzip or lzf
‘–output_compression_opts’int: Additional settings for the optional compression algorithm. Take a look at the create_dataset function of h5py library for more information.
‘–pedigree_nan’str, default=0: The value representing NaN in the pedigreee.

Results:

HDF5 files: For each chromosome i, an HDF5 file is created at outprefix{i}. This file contains imputed genotypes, the position of SNPs, columns of resulting bim file, contents of resulting bim file, pedigree table and, family ids of the imputed parents, under the keys ‘imputed_par_gts’, ‘pos’, ‘bim_columns’, ‘bim_values’, ‘pedigree’ and, ‘families’, ‘parental_status’ respectively. There are also other details of the imputation in the resulting file. For more explanation see the documentation of snipar.imputation.impute_from_sibs.impute

gwas.py

Infers direct effects, non-transmitted coefficients (NTCs), and population effects of genome-wide SNPs on a phenotype.

Minimally: the script requires observed genotypes on phenotyped individuals and their parents, and/or parental genotypes imputed by snipar’s impute.py script, along with a phenotype file.

Args:

‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str: Location of the phenotype file

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–imp’str

Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–out’str, default=./

The summary statistics will output to this path, one file for each chromosome. If the path contains ‘@’, the ‘@’ will be replaced with the chromosome number. Otherwise, the summary statistics will be output to the given path with file names chr_1.sumstats.gz, chr_2.sumstats.gz, etc. for the text sumstats, and chr_1.sumstats.hdf5, etc. for the HDF5 sumstats

‘–pedigree’str

Address of pedigree file. Must be provided if not providing imputed parental genotypes.

‘–parsum’

Regress onto proband and sum of (imputed/observed) maternal and paternal genotypes. Default uses separate paternal and maternal genotypes when available.

‘–fit_sib’

Fit indirect effect from sibling

‘–covar’str

Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..

‘–phen_index’int, default=1

If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)

‘–min_maf’float, default=0.01

Ignore SNPs with minor allele frequency below min_maf (default 0.01)

‘–threads’int

Number of threads to use for IBD inference. Uses all available by default.

‘–max_missing’float, default=5

Ignore SNPs with greater percent missing calls than max_missing (default 5)

‘–batch_size’int, default=100000

Batch size of SNPs to load at a time (reduce to reduce memory requirements)

‘–no_hdf5_out’

Suppress HDF5 output of summary statistics

‘–no_txt_out’

Suppress text output of summary statistics

‘–missing_char’str, default=NA

Missing value string in phenotype file (default NA)

‘–tau_init’float, default=1

Initial value for ratio between shared family environmental variance and residual variance

Results:

sumstats.gz: For each chromosome, a gzipped text file containing the SNP level summary statistics.

pgs.py

Infers direct effects, non-transmitted coefficients (NTCs), and population effects of a PGS on a phenotype.

Minimally: the script requires observed genotypes on individuals and their parents, and/or parental genotypes imputed by snipar’s impute.py script, along with a SNP weights file.

Args:

‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str: Prefix for computed PGS file and/or regression results files

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–imp’str

Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–pedigree’str

Address of pedigree file. Must be provided if not providing imputed parental genotypes.

‘–weights’str

Location of the PGS allele weights

‘–SNP’str, default=SNP

Name of column in weights file with SNP IDs

‘–beta_col’str, default=b

Name of column with betas/weights for each SNP

‘–A1’str, default=A1

Name of column with allele beta/weights are given with respect to

‘–A2’str, default=A2

Name of column with alternative allele

‘–sep’str

Column separator in weights file. If not provided, an attempt to determine this will be made.

‘–phenofile’str

Location of the phenotype file

‘–pgs’str

Location of the pre-computed PGS file

‘–covar’str

Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..

‘–fit_sib’

Fit indirect effects from siblings

‘–parsum’

Use the sum of maternal and paternal PGS in the regression (useful when imputed from sibling data alone)

‘–grandpar’

Calculate imputed/observed grandparental PGS for individuals with both parents genotyped

‘–gparsum’

Use the sum of maternal grandparents and the sum of paternal grandparents in the regression (useful when no grandparents genotyped)

‘–gen_models’, default=1-2

Which multi-generational models should be fit. Default fits 1 and 2 generation models. Specify a range by, for example, 1-3, where 3 fits a model with parental and grandparental scores

‘–h2f’str

Provide heritability estimate in form h2f,h2f_SE (e.g. 0.5,0.01) from MZ-DZ comparison, RDR, or sibling realized relatedness. If provided when also fitting 2 generation model, will adjust results for assortative mating assuming equilibrium.

‘–rk’str

Provide estimate of the correlation between parents PGIs in the form rk,rk_SE (e.g 0.1,0.01). If provided with h2f, will use for adjusting estimates for assortative mating.

‘–bpg’

Restrict sample to those with both parents genotyped

‘–phen_index’int, default=1

If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)

‘–ibdrel_path’str

Path to KING IBD segment inference output (without .seg prefix).

‘–sparse_thresh’float, default=0.05

Threshold of GRM/IBD sparsity

‘–scale_phen’

Scale the phenotype to have variance 1

‘–scale_pgs’

Scale the PGS to have variance 1 among the phenotyped individuals

‘–compute_controls’

Compute PGS for control families (default False)

‘–missing_char’str, default=NA

Missing value string in phenotype file (default NA)

‘–no_am_adj’

Do not adjust imputed parental PGSs for assortative mating

‘–force_am_adj’

Force assortative mating adjustment even when estimated correlation is noisy/not significant

‘–threads’int, default=1

Number of threads to use

‘–batch_size’int, default=10000

Batch size for reading in SNPs (default 10000)

Results:

PGS file: Output when inputting observed and imputed genotype files and a weights file. A file with PGS values for each individual and their parents, with suffix .pgs.txt. Also includes sibling PGS if –fit_sib is specified, and grandparental PGS if –grandpar is specified.
PGS effect estimates: Output when inputting a phenotype file. A file with suffix effects.txt containing estimates of the PGS effects and their standard errors, and a file with suffix vcov.txt containing the sampling variance-covariance matrix of the effect estimates

correlate.py

Infers correlations between direct effects and population effects, and between direct effects and average non-transmitted coefficients (NTCs). Minimally: the script requires summary statistics as output by snipar’s gwas.py script, and either LD-scores (as output by snipar’s ibd.py script or LDSC) or .bed files from which LD-scores can be computed Args:

‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str
Address of sumstats files in SNIPar sumstats.gz text format (without .sumstats.gz suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in chr_range (optional argument)

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

: str
Prefix for output file(s)

‘–ldscores’str
Address of ldscores as output by LDSC

‘–bed’str
Address of observed genotype files in .bed format (without .bed suffix). If there is a # in the address, # is replaced by the chromosome numbers in the range of 1-22.

‘–threads’int
Number of threads to use for IBD inference. Uses all available by default.

‘–min_maf’float, default=0.05
Ignore SNPs with minor allele frequency below min_maf (default 0.05)

‘–corr_filter’float, default=6.0
Filter out SNPs with outlying sampling correlations more than corr_filter SDs from mean (default 6)

‘–n_blocks’int, default=200
Number of blocks to use for block-jacknife variance estimate (default 200)

‘–save_delete’
Save jacknife delete values

‘–ld_wind’float, default=1.0
The window, in cM, within which LD scores are computed (default 1cM)

‘–ld_out’str
Output LD scores in LDSC format to this address

Results:

correlations: A text file containing the estimated correlations and their standard errors.

simulate.py

Simulates genotype-phenotype data using forward simulation. Phenotypes can be affected by direct genetic effects, indirect genetic effects (vertical transmission), and assortative mating.

Args:

‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: int: Number of causal loci
: float: Heritability due to direct effects in first generation
: str: Prefix for simulation output files

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–nfam’int

Number of families to simulate. If inputting bgen and not given, will be one half of samples in bgen

‘–min_maf’float, default=0.05

Minimum minor allele frequency for simulated genotyped, which will be simulted from density proportional to 1/x

‘–maf’float

Minor allele frequency for simulated genotypes (not needed when providing bgen files)

‘–n_random’int

Number of generations of random mating

‘–n_am’int

Number of generations of assortative mating

‘–r_par’float

Phenotypic correlation of parents (for assortative mating)

‘–v_indir’float

Variance explained by parental indirect genetic effects as a fraction of the heritability, e.g 0.5

‘–r_dir_indir’float

Correlation between direct and indirect genetic effects

‘–beta_vert’float

Vertical transmission coefficient

‘–save_par_gts’

Save the genotypes of the parents of the final generation

‘–impute’

Impute parental genotypes from phased sibling genotypes & IBD

‘–unphased_impute’

Impute parental genotypes from unphased sibling genotypes & IBD

Results:

genotype data in .bed format; full pedigree including phenotype and genetic components for all generations