Command Line Scripts

ibd.py

Infers identity-by-descent (IBD) segments shared between full-siblings.

Minimally: the script requires observed sibling genotypes in either .bed or .bgen format, along with information on the relations present in the dataset, which can be provided using a pedigree file or the results of KING kinship inference along with age and sex information (from which a pedigree can be constructed).

Args:
‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–king’str

Address of the king file

‘–agesex’str

Address of file with age and sex information

‘–pedigree’str

Address of pedigree file

‘–map’str

None

‘–out’str, default=ibd

The IBD segments will output to this path, one file for each chromosome. If the path contains ‘#’, the ‘#’ will be replaced with the chromosome number. Otherwise, the segments will be output to the given path with file names chr_1.ibd.segments.gz, chr_2.segments.gz, etc.

‘–p_error’float

Probability of genotyping error. By default, this is estimated from genotyped parent-offspring pairs.

‘–min_length’float, default=0.01

Smooth segments with length less than min_length (cM)

‘–threads’int

Number of threads to use for IBD inference. Uses all available by default.

‘–min_maf’float, default=0.01

Minimum minor allele frequency

‘–max_missing’float, default=5

Ignore SNPs with greater percent missing calls than max_missing (default 5)

‘–max_error’float, default=0.01

Maximum per-SNP genotyping error probability

‘–ibdmatrix’

Output a matrix of SNP IBD states (in addition to segments file)

‘–ld_out’

Output LD scores of SNPs (used internally for weighting).

‘–chrom’int

The chromosome of the input .bgen file. Helpful if inputting a single .bgen file without chromosome information.

‘–batches’int, default=1

Number of batches to split the data (by sibpair) into for IBD inference. Useful for large datasets.

Results:
IBD segments

For each chromosome, a gzipped text file containing the IBD segments for the siblings is output.

impute.py

gwas.py

Infers direct effects, non-transmitted coefficients (NTCs), and population effects of genome-wide SNPs on a phenotype.

Minimally: the script requires observed genotypes on phenotyped individuals along with a phenotype file. If no imputed parental genotypes are provided, a pedigree file is required, and the script will analyze samples with siblings and/or both parents genotyped by default.

Args:
‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str

Location of the phenotype file

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–imp’str

Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).

‘–pedigree’str

Address of pedigree file. Must be provided if not providing imputed parental genotypes.

‘–covar’str

Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..

‘–chr_range’

Chromosomes to analyse. Should be a series of ranges with x-y format (e.g. 1-22) or integers.

‘–out’str, default=./

The summary statistics will output to this path, one file for each chromosome. If the path contains ‘@’, the ‘@’ will be replaced with the chromosome number. Otherwise, the summary statistics will be output to the given path with file names chr_1.sumstats.gz, chr_2.sumstats.gz, etc. for the text sumstats, and chr_1.sumstats.hdf5, etc. for the HDF5 sumstats

‘–grm’str

Path to GRM file giving pairwise relatednsss information. Designed to work with KING IBD segment inference output (.seg file).

‘–grmgz’str

Path to GRM in GCTA grm.gz format (without .grm.gz suffix). Assumes .grm.id file with same root path also available.

‘–sparse_thresh’float, default=0.05

Threshold of GRM sparsity — elements below this value are set to zero

‘–impute_unrel’

Whether to include unrelated individuals and impute their parental genotypes lineary or not. See Unified estimator in Guan et al.

‘–robust’

Use the robust estimator

‘–sib_diff’

Use the sibling difference method

‘–parsum’

Regress onto proband and sum of (imputed/observed) maternal and paternal genotypes. Default uses separate paternal and maternal genotypes when available.

‘–fit_sib’

Fit indirect effect from sibling

‘–phen’str

Name of the phenotype to be analysed — case sensitive. Default uses first phenotype in file.

‘–phen_index’int, default=1

If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)

‘–missing_char’str, default=NA

Missing value string in phenotype file (default NA)

‘–min_maf’float, default=0.01

Ignore SNPs with minor allele frequency below min_maf (default 0.01)

‘–max_missing’float, default=5

Ignore SNPs with greater percent missing calls than max_missing (default 5)

‘–vc_out’str

Prefix of output filename for variance component array (without .npy).

‘–vc_list’float

Pass in variance components as a list of floats.

‘–no_sib_var’

Do not fit sibling variance component. Not recommended for family-GWAS.

‘–keep’str

Filename of IDs to be kept for analysis (No header).

‘–cpus’int, default=1

Number of cpus to distribute batches across

‘–threads’int, default=1

Number of threads to use per CPU. Uses all available by default.

‘–no_hdf5_out’

Suppress HDF5 output of summary statistics

‘–batch_size’int, default=100000

Batch size of SNPs to load at a time (reduce to reduce memory requirements)

Results:
sumstats.gz

For each chromosome, a gzipped text file containing the SNP level summary statistics.

pgs.py

Infers direct effects, non-transmitted coefficients (NTCs), and population effects of a PGS on a phenotype.

Minimally: the script requires observed genotypes on individuals and their parents, and/or parental genotypes imputed by snipar’s impute.py script, along with a SNP weights file.

Args:
‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str

Prefix for computed PGS file and/or regression results files

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–bed’str

Address of the unphased genotypes in .bed format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–imp’str

Address of hdf5 files with imputed parental genotypes (without .hdf5 suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–pedigree’str

Address of pedigree file. Must be provided if not providing imputed parental genotypes.

‘–weights’str

Location of the PGS allele weights

‘–SNP’str, default=SNP

Name of column in weights file with SNP IDs

‘–beta_col’str, default=b

Name of column with betas/weights for each SNP

‘–A1’str, default=A1

Name of column with allele beta/weights are given with respect to

‘–A2’str, default=A2

Name of column with alternative allele

‘–sep’str

Column separator in weights file. If not provided, an attempt to determine this will be made.

‘–phenofile’str

Location of the phenotype file

‘–pgs’str

Location of the pre-computed PGS file

‘–covar’str

Path to file with covariates: plain text file with columns FID, IID, covar1, covar2, ..

‘–fit_sib’

Fit indirect effects from siblings

‘–parsum’

Use the sum of maternal and paternal PGS in the regression (useful when imputed from sibling data alone)

‘–grandpar’

Calculate imputed/observed grandparental PGS for individuals with both parents genotyped

‘–gparsum’

Use the sum of maternal grandparents and the sum of paternal grandparents in the regression (useful when no grandparents genotyped)

‘–gen_models’, default=1-2

Which multi-generational models should be fit. Default fits 1 and 2 generation models. Specify a range by, for example, 1-3, where 3 fits a model with parental and grandparental scores

‘–h2f’str

Provide heritability estimate in form h2f,h2f_SE (e.g. 0.5,0.01) from MZ-DZ comparison, RDR, or sibling realized relatedness. If provided when also fitting 2 generation model, will adjust results for assortative mating assuming equilibrium.

‘–rk’str

Provide estimate of the correlation between parents PGIs in the form rk,rk_SE (e.g 0.1,0.01). If provided with h2f, will use for adjusting estimates for assortative mating.

‘–bpg’

Restrict sample to those with both parents genotyped

‘–phen’str

Name of the phenotype to be analysed — case sensitive. Default uses first phenotype in file.

‘–phen_index’int, default=1

If the phenotype file contains multiple phenotypes, which phenotype should be analysed (default 1, first)

‘–grm’str

Path to GRM file giving pairwise relatednsss information. Designed to work with KING IBD segment inference output (.seg file).

‘–sparse_thresh’float, default=0.05

Threshold of GRM/IBD sparsity

‘–scale_phen’

Scale the phenotype to have variance 1

‘–scale_pgs’

Scale the PGS to have variance 1 among the phenotyped individuals

‘–compute_controls’

Compute PGS for control families (default False)

‘–missing_char’str, default=NA

Missing value string in phenotype file (default NA)

‘–no_am_adj’

Do not adjust imputed parental PGSs for assortative mating

‘–force_am_adj’

Force assortative mating adjustment even when estimated correlation is noisy/not significant

‘–threads’int, default=1

Number of threads to use

‘–batch_size’int, default=10000

Batch size for reading in SNPs (default 10000)

Results:
PGS file

Output when inputting observed and imputed genotype files and a weights file. A file with PGS values for each individual and their parents, with suffix .pgs.txt. Also includes sibling PGS if –fit_sib is specified, and grandparental PGS if –grandpar is specified.

PGS effect estimates

Output when inputting a phenotype file. A file with suffix effects.txt containing estimates of the PGS effects and their standard errors, and a file with suffix vcov.txt containing the sampling variance-covariance matrix of the effect estimates

correlate.py

Infers correlations between direct effects and population effects, and between direct effects and average non-transmitted coefficients (NTCs). Minimally: the script requires summary statistics as output by snipar’s gwas.py script, and either LD-scores (as output by snipar’s ibd.py script or LDSC) or .bed files from which LD-scores can be computed Args:

‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: str

Address of sumstats files in SNIPar sumstats.gz text format (without .sumstats.gz suffix). If there is a @ in the address, @ is replaced by the chromosome numbers in chr_range (optional argument)

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

: str

Prefix for output file(s)

‘–ldscores’str

Address of ldscores as output by LDSC

‘–bed’str

Address of observed genotype files in .bed format (without .bed suffix). If there is a # in the address, # is replaced by the chromosome numbers in the range of 1-22.

‘–threads’int

Number of threads to use for IBD inference. Uses all available by default.

‘–min_maf’float, default=0.05

Ignore SNPs with minor allele frequency below min_maf (default 0.05)

‘–corr_filter’float, default=6.0

Filter out SNPs with outlying sampling correlations more than corr_filter SDs from mean (default 6)

‘–n_blocks’int, default=200

Number of blocks to use for block-jacknife variance estimate (default 200)

‘–save_delete’

Save jacknife delete values

‘–ld_wind’float, default=1.0

The window, in cM, within which LD scores are computed (default 1cM)

‘–ld_out’str

Output LD scores in LDSC format to this address

Results:
correlations

A text file containing the estimated correlations and their standard errors.

simulate.py

Simulates genotype-phenotype data using forward simulation. Phenotypes can be affected by direct genetic effects, indirect genetic effects (vertical transmission), and assortative mating.

Args:
‘-h’, ‘–help’, default===SUPPRESS==

show this help message and exit

: int

Number of causal loci

: float

Heritability due to direct effects in first generation

: str

Prefix for simulation output files

‘–bgen’str

Address of the phased genotypes in .bgen format. If there is a @ in the address, @ is replaced by the chromosome numbers in the range of chr_range for each chromosome (chr_range is an optional parameters for this script).

‘–chr_range’

number of the chromosomes to be imputed. Should be a series of ranges with x-y format or integers.

‘–nfam’int

Number of families to simulate. If inputting bgen and not given, will be one half of samples in bgen

‘–min_maf’float, default=0.05

Minimum minor allele frequency for simulated genotyped, which will be simulted from density proportional to 1/x

‘–maf’float

Minor allele frequency for simulated genotypes (not needed when providing bgen files)

‘–n_random’int

Number of generations of random mating

‘–n_am’int

Number of generations of assortative mating

‘–r_par’float

Phenotypic correlation of parents (for assortative mating)

‘–v_indir’float

Variance explained by parental indirect genetic effects as a fraction of the heritability, e.g 0.5

‘–r_dir_indir’float

Correlation between direct and indirect genetic effects

‘–beta_vert’float

Vertical transmission coefficient

‘–save_par_gts’

Save the genotypes of the parents of the final generation

‘–impute’

Impute parental genotypes from phased sibling genotypes & IBD

‘–unphased_impute’

Impute parental genotypes from unphased sibling genotypes & IBD

Results:

genotype data in .bed format; full pedigree including phenotype and genetic components for all generations