Command Line Scripts
ibd.py
Infers identity-by-descent (IBD) segments shared between full-siblings.
Minimally: the script requires observed sibling genotypes in either .bed or .bgen format, along with information on the relations present in the dataset, which can be provided using a pedigree file or the results of KING kinship inference along with age and sex information (from which a pedigree can be constructed).
- Args:
- ‘-h’, ‘–help’, default===SUPPRESS==
show this help message and exit
- ‘–bgen’str
Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–bed’str
Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–chr_range’
number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
- ‘–king’str
Address of the king file
- ‘–agesex’str
Address of file with age and sex information
- ‘–pedigree’str
Address of pedigree file
- ‘–map’str
None
- ‘–out’str, default=ibd
The IBD segments will output to this path, one file for each chromosome. If the path contains ‘#’, the ‘#’ will be replaced with the chromosome number. Otherwise, the segments will be output to the given path with file names chr_1.ibd.segments.gz, chr_2.segments.gz, etc.
- ‘–p_error’float
Probability of genotyping error. By default, this is estimated from genotyped parent-offspring pairs.
- ‘–min_length’float, default=0.01
Smooth segments with length less than min_length (cM)
- ‘–threads’int
Number of threads to use for IBD inference. Uses all available by default.
- ‘–min_maf’float, default=0.01
Minimum minor allele frequency
- ‘–max_missing’float, default=5
Ignore SNPs with greater percent missing calls than max_missing (default 5)
- ‘–max_error’float, default=0.01
Maximum per-SNP genotyping error probability
- ‘–ibdmatrix’
Output a matrix of SNP IBD states (in addition to segments file)
- ‘–ld_out’
Output LD scores of SNPs (used internally for weighting).
- ‘–chrom’int
The chromosome of the input .bgen file. Helpful if inputting a single .bgen file without chromosome information.
- ‘–batches’int, default=1
Number of batches to split the data (by sibpair) into for IBD inference. Useful for large datasets.
- Results:
- IBD segments
For each chromosome, a gzipped text file containing the IBD segments for the siblings is output.
impute.py
gwas.py
Infers direct effects, non-transmitted coefficients (NTCs), and population effects of genome-wide SNPs on a phenotype.
Minimally: the script requires observed genotypes on phenotyped individuals along with a phenotype file. If no imputed parental genotypes are provided, a pedigree file is required, and the script will analyze samples with siblings and/or both parents genotyped by default.
- Args:
- ‘-h’, ‘–help’, default===SUPPRESS==
show this help message and exit
- : str
Location of the phenotype file
- ‘–bgen’str
Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–bed’str
Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–imp’str
Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).
- ‘–pedigree’str
Address of pedigree file. Must be provided if not providing imputed parental genotypes.
- ‘–covar’str
Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..
- ‘–chr_range’
Chromosomes to analyse. Should be a series of ranges with x-y format (e.g. 1-22) or integers.
- ‘–out’str, default=./
The summary statistics will output to this path, one file for each chromosome. If the path contains ‘@’, the ‘@’ will be replaced with the chromosome number. Otherwise, the summary statistics will be output to the given path with file names chr_1.sumstats.gz, chr_2.sumstats.gz, etc. for the text sumstats, and chr_1.sumstats.hdf5, etc. for the HDF5 sumstats
- ‘–grm’str
Path to GRM file giving pairwise relatednsss information. Designed to work with KING IBD segment inference output (.seg file).
- ‘–grmgz’str
Path to GRM in GCTA grm.gz format (without .grm.gz suffix). Assumes .grm.id file with same root path also available.
- ‘–sparse_thresh’float, default=0.05
Threshold of GRM sparsity — elements below this value are set to zero
- ‘–impute_unrel’
Whether to include unrelated individuals and impute their parental genotypes lineary or not. See Unified estimator in Guan et al.
- ‘–robust’
Use the robust estimator
- ‘–sib_diff’
Use the sibling difference method
- ‘–parsum’
Regress onto proband and sum of (imputed/observed) maternal and paternal genotypes. Default uses separate paternal and maternal genotypes when available.
- ‘–fit_sib’
Fit indirect effect from sibling
- ‘–phen’str
Name of the phenotype to be analysed — case sensitive. Default uses first phenotype in file.
- ‘–phen_index’int, default=1
If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)
- ‘–missing_char’str, default=NA
Missing value string in phenotype file (default NA)
- ‘–min_maf’float, default=0.01
Ignore SNPs with minor allele frequency below min_maf (default 0.01)
- ‘–max_missing’float, default=5
Ignore SNPs with greater percent missing calls than max_missing (default 5)
- ‘–vc_out’str
Prefix of output filename for variance component array (without .npy).
- ‘–vc_list’float
Pass in variance components as a list of floats.
- ‘–no_sib_var’
Do not fit sibling variance component. Not recommended for family-GWAS.
- ‘–keep’str
Filename of IDs to be kept for analysis (No header).
- ‘–cpus’int, default=1
Number of cpus to distribute batches across
- ‘–threads’int, default=1
Number of threads to use per CPU. Uses all available by default.
- ‘–no_hdf5_out’
Suppress HDF5 output of summary statistics
- ‘–batch_size’int, default=100000
Batch size of SNPs to load at a time (reduce to reduce memory requirements)
- Results:
- sumstats.gz
For each chromosome, a gzipped text file containing the SNP level summary statistics.
pgs.py
Infers direct effects, non-transmitted coefficients (NTCs), and population effects of a PGS on a phenotype.
Minimally: the script requires observed genotypes on individuals and their parents, and/or parental genotypes imputed by snipar’s impute.py script, along with a SNP weights file.
- Args:
- ‘-h’, ‘–help’, default===SUPPRESS==
show this help message and exit
- : str
Prefix for computed PGS file and/or regression results files
- ‘–bgen’str
Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–bed’str
Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–imp’str
Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).
- ‘–chr_range’
number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
- ‘–pedigree’str
Address of pedigree file. Must be provided if not providing imputed parental genotypes.
- ‘–weights’str
Location of the PGS allele weights
- ‘–SNP’str, default=SNP
Name of column in weights file with SNP IDs
- ‘–beta_col’str, default=b
Name of column with betas/weights for each SNP
- ‘–A1’str, default=A1
Name of column with allele beta/weights are given with respect to
- ‘–A2’str, default=A2
Name of column with alternative allele
- ‘–sep’str
Column separator in weights file. If not provided, an attempt to determine this will be made.
- ‘–phenofile’str
Location of the phenotype file
- ‘–pgs’str
Location of the pre-computed PGS file
- ‘–covar’str
Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..
- ‘–fit_sib’
Fit indirect effects from siblings
- ‘–parsum’
Use the sum of maternal and paternal PGS in the regression (useful when imputed from sibling data alone)
- ‘–grandpar’
Calculate imputed/observed grandparental PGS for individuals with both parents genotyped
- ‘–gparsum’
Use the sum of maternal grandparents and the sum of paternal grandparents in the regression (useful when no grandparents genotyped)
- ‘–gen_models’, default=1-2
Which multi-generational models should be fit. Default fits 1 and 2 generation models. Specify a range by, for example, 1-3, where 3 fits a model with parental and grandparental scores
- ‘–h2f’str
Provide heritability estimate in form h2f,h2f_SE (e.g. 0.5,0.01) from MZ-DZ comparison, RDR, or sibling realized relatedness. If provided when also fitting 2 generation model, will adjust results for assortative mating assuming equilibrium.
- ‘–rk’str
Provide estimate of the correlation between parents PGIs in the form rk,rk_SE (e.g 0.1,0.01). If provided with h2f, will use for adjusting estimates for assortative mating.
- ‘–bpg’
Restrict sample to those with both parents genotyped
- ‘–phen’str
Name of the phenotype to be analysed — case sensitive. Default uses first phenotype in file.
- ‘–phen_index’int, default=1
If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)
- ‘–grm’str
Path to GRM file giving pairwise relatednsss information. Designed to work with KING IBD segment inference output (.seg file).
- ‘–sparse_thresh’float, default=0.05
Threshold of GRM/IBD sparsity
- ‘–scale_phen’
Scale the phenotype to have variance 1
- ‘–scale_pgs’
Scale the PGS to have variance 1 among the phenotyped individuals
- ‘–compute_controls’
Compute PGS for control families (default False)
- ‘–missing_char’str, default=NA
Missing value string in phenotype file (default NA)
- ‘–no_am_adj’
Do not adjust imputed parental PGSs for assortative mating
- ‘–force_am_adj’
Force assortative mating adjustment even when estimated correlation is noisy/not significant
- ‘–threads’int, default=1
Number of threads to use
- ‘–batch_size’int, default=10000
Batch size for reading in SNPs (default 10000)
- Results:
- PGS file
Output when inputting observed and imputed genotype files and a weights file. A file with PGS values for each individual and their parents, with suffix .pgs.txt. Also includes sibling PGS if –fit_sib is specified, and grandparental PGS if –grandpar is specified.
- PGS effect estimates
Output when inputting a phenotype file. A file with suffix effects.txt containing estimates of the PGS effects and their standard errors, and a file with suffix vcov.txt containing the sampling variance-covariance matrix of the effect estimates
correlate.py
Infers correlations between direct effects and population effects, and between direct effects and average non-transmitted coefficients (NTCs). Minimally: the script requires summary statistics as output by snipar’s gwas.py script, and either LD-scores (as output by snipar’s ibd.py script or LDSC) or .bed files from which LD-scores can be computed Args:
- ‘-h’, ‘–help’, default===SUPPRESS==
show this help message and exit
- : str
Address of sumstats files in SNIPar sumstats.gz text format (without .sumstats.gz suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in chr_range (optional argument)
- ‘–chr_range’
number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
- : str
Prefix for output file(s)
- ‘–ldscores’str
Address of ldscores as output by LDSC
- ‘–bed’str
Address of observed genotype files in .bed format (without .bed suffix). If there is a # in the address, # is replaced by the chromosome numbers in the range of 1-22.
- ‘–threads’int
Number of threads to use for IBD inference. Uses all available by default.
- ‘–min_maf’float, default=0.05
Ignore SNPs with minor allele frequency below min_maf (default 0.05)
- ‘–corr_filter’float, default=6.0
Filter out SNPs with outlying sampling correlations more than corr_filter SDs from mean (default 6)
- ‘–n_blocks’int, default=200
Number of blocks to use for block-jacknife variance estimate (default 200)
- ‘–save_delete’
Save jacknife delete values
- ‘–ld_wind’float, default=1.0
The window, in cM, within which LD scores are computed (default 1cM)
- ‘–ld_out’str
Output LD scores in LDSC format to this address
- Results:
- correlations
A text file containing the estimated correlations and their standard errors.
simulate.py
Simulates genotype-phenotype data using forward simulation. Phenotypes can be affected by direct genetic effects, indirect genetic effects (vertical transmission), and assortative mating.
- Args:
- ‘-h’, ‘–help’, default===SUPPRESS==
show this help message and exit
- : int
Number of causal loci
- : float
Heritability due to direct effects in first generation
- : str
Prefix for simulation output files
- ‘–bgen’str
Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).
- ‘–chr_range’
number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.
- ‘–nfam’int
Number of families to simulate. If inputting bgen and not given, will be one half of samples in bgen
- ‘–min_maf’float, default=0.05
Minimum minor allele frequency for simulated genotyped, which will be simulted from density proportional to 1/x
- ‘–maf’float
Minor allele frequency for simulated genotypes (not needed when providing bgen files)
- ‘–n_random’int
Number of generations of random mating
- ‘–n_am’int
Number of generations of assortative mating
- ‘–r_par’float
Phenotypic correlation of parents (for assortative mating)
- ‘–v_indir’float
Variance explained by parental indirect genetic effects as a fraction of the heritability, e.g 0.5
- ‘–r_dir_indir’float
Correlation between direct and indirect genetic effects
- ‘–beta_vert’float
Vertical transmission coefficient
- ‘–save_par_gts’
Save the genotypes of the parents of the final generation
- ‘–impute’
Impute parental genotypes from phased sibling genotypes & IBD
- ‘–unphased_impute’
Impute parental genotypes from unphased sibling genotypes & IBD
- Results:
genotype data in .bed format; full pedigree including phenotype and genetic components for all generations