Aldy
Aldy is a tool for allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes.
More simply, it is a tool which can detect the copy number of a target gene, and the structure and genotype of each gene copy present in the sample.
Aldy has been published in Nature Communications (doi:10.1038/s41467-018-03273-1). Preprint is available here. Full experimental pipeline is available here.
Installation
Aldy is written in Python, and supports both Python 2.7 and 3. It is intended to be run on POSIX-based systems (only Linux and macOS have been tested).
The easiest way to install Aldy is to use pip
:
pip install git+https://github.com/inumanag/aldy.git
Add --user
to install it locally if you cannot write to the system-wide Python directory.
Prerequisite: ILP solver
Although Aldy can be installed without any ILP solver, it cannot produce any meaningful results without supported ILP solver. Unfortunately, as of now, none of the currently supported solvers can be installed automatically.
Solvers which are currently supported include:
- Gurobi (highly recommended):
a commercial solver which is free for academic purposes.
After installing it, don’t forget to install
gurobipy
package by going to Gurobi’s installation directory (e.g. /opt/gurobi/linux64
on Linux or /Library/gurobi751/mac64/
on macOS) and issuing python setup.py install
.
- SCIP:
another solver that is also free for academic purposes, and that is bit easier to obtain than Gurobi (no registration required). However, please note that it is slower than Gurobi. Please bear in mind that the results from the paper were generated by Gurobi, although SCIP should provide the same results. Once you install SCIP, please install PySCIPPpt module for the Python SCIP bindings.
Sanity check
After installing Aldy and the supported ILP solver, please make sure to test the installation by issuing the following command (this should take around a minute):
In case everything is set up properly, you should see something like this:
Aldy Sanity-Check Test
Expected result is: *1/*4+*4
*** Aldy v1.2 (Python 3.6) ***
(c) 2017 SFU, MIT & IUB. All rights reserved.
Arguments:
Gene: CYP2D6
Profile: illumina
...[redacted]...
Result:
*1/*4+*4 (1, 4N, 4AW)
Running
Aldy needs an aligned SAM, BAM, CRAM or DeeZ file for the analysis. Below we will use BAM as an example.
It is assumed that reads are mapped to hg19 or GRCh37. hg38 is not yet supported.
An index is needed for BAM files. Get one by running:
Aldy is invoked as:
aldy -p [profile] -g [gene] file.bam
The [gene]
parameter indicates the name of the gene to be genotyped. Currently, Aldy supports CYP2D6, CYP2A6, CYP2C19, CYP2C8, CYP2C9, CYP3A4, CYP3A5, CYP4F2, TPMT and DPYD.
Sequencing profile selection
The [profile]
argument refers to the sequencing profile. The following profiles are available:
-
illumina
for Illumina WGS (or any uniform-coverage technology) samples.
It is highly recommended to use samples with at least 40x coverage. Also, anything lower than 20x will result in tears and agony.
pgrnseq-v1
for PGRNseq v.1 capture protocol data
pgrnseq-v2
for PGRNseq v.2 capture protocol data
In case you are using different technology (i.e. some home-brewed capture kit), you can use it provided that the following requirements are met:
- all samples have the similar coverage distribution (i.e. two sequenced samples with the same copy number configuration MUST have similar coverage profiles; please consult us if you are not sure about this)
- your panel includes a copy-number neutral region (currently, Aldy uses CYP2D8 as a copy-number neutral region)
Having said that, you can generate the profile for your panel/technology by running:
# Get the profile
aldy --generate-profile file.bam > my-cool-tech.profile
# Run Aldy
aldy -p my-cool-tech.profile -g [gene] file.bam
Output
Aldy will generate the following files: file-[gene].aldy
(default location can be changed via -o
parameter), and file-[gene].aldylog
(default location can be changed via -l
parameter). The summary of results are shown
at the end of the output:
$ aldy -p pgrnseq-v2 -g cyp2d6 NA19788_x.bam
*** Aldy v1.0 ***
(c) 2017 SFU, MIT & IUB. All rights reserved.
Arguments:
Gene: CYP2D6
Profile: pgrnseq-v2
Threshold: 50%
Input: NA19788_x.bam
Output: NA19788_x.CYP2D6.aldy
Log: NA19788_x.CYP2D6.aldylog
Result:
*2/*78+*2 (2MW, 2MW, 78/2|2M)
In this example, CYP2D6 genotype is *2/*78+*2 as expressed in terms of major star-alleles. Minor star-alleles are given in the parenthesis (in this case, two copies of *2MW, and one copy of *78 fusion on the *2M background).
Explicit decomposition is given in the file-[gene].aldy
(in the example above, it is NA19788_x.CYP2D6.aldy
).
Example of such file is:
# Aldy v1.0
# Gene: CYP2D6
# Number of solutions: 1
# Solution 0
# Predicted diplotype: *2/*78+*2
# Composition: 2MW,2MW,78/2|2M
Copy Allele Location Type Coverage Effect dbSNP Code Status
0 78/2 42522311 SNP.CT 1760 NEUTRAL rs12169962 4481:G>A NORMAL
0 78/2 42522612 SNP.CG 1287 DISRUPTING rs1135840 4180:G>C NORMAL
...[redacted]...
1 2MW 42522311 SNP.CT 1760 NEUTRAL rs12169962 4481:G>A NORMAL
1 2MW 42527541 DEL.TC 0 NEUTRAL rs536645539 -750:delGA MISSING
...[redacted]...
Each solution is indicated with the “Solution” line. The first column (copy) shows the ordinary number of the allelic copy (e.g. 0, 1 and 2 for 2MW, 2MW and 78/2M, respectively). The following columns indicate:
- star-allele,
- mutation loci,
- mutation type (SNP or indel),
- mutation coverage,
- mutation functionality:
DISRUPTING
for gene-disrupting
NEUTRAL
for neutral mutation,
- dbSNP ID (if available),
- traditional Karolinska-style mutation code from CYP allele database, and
- mutation status, which indicates the status of the mutation in the decomposition:
NORMAL
: mutation is associated with the star-allele in the database, and is found in the sample
NOVEL
: gene-disrupting mutation is NOT associated with the star-allele in the database,
but is found in the sample (this indicates that Aldy found a novel major star-allele)
EXTRA
: neutral mutation is NOT associated with the star-allele in the database,
but is found in the sample (this indicates that Aldy found a novel minor star-allele)
MISSING
: neutral mutation is associated with the star-allele in the database,
but is NOT found in the sample (this also indicates that Aldy found a novel minor star-allele)
Logging
Detailed execution log will be located in file-[gene].aldylog
. It is used mainly for debugging purposes.
Sample datasets
Sample datasets are also available for download. They include:
- HG00463 (PGRNseq v.2), containing CYP2D6 configuration with multiple copies
- NA19790 (PGRNseq v.2), containing a fusion between CYP2D6 and CYP2D7 deletion (*78 allele)
- NA24027 (PGRNseq v.1), containing novel DPYD allele and multiple copies of CYP2D6
- NA10856 (PGRNseq v.1), containing CYP2D6 deletion (*5 allele)
- NA10860 (Illumina WGS), containing 3 copies of CYP2D6. This sample contains only CYP2D6 region.
Expected results are:
Gene (-g ) |
HG00463 |
NA19790 |
NA24027 |
NA10856 |
NA10860 |
CYP2D6 |
*36+*10/*36+*10 |
*1/*78+*2 |
*6/*2+*2 |
*1/*5 |
*1/*4+*4 |
CYP2A6 |
*1/*1 |
*1/*1 |
*1/*35 |
*1/*1 |
|
CYP2C19 |
*1/*3 |
*1/*1 |
*1/*2 |
*1/*2 |
|
CYP2C8 |
*1/*1 |
*1/*3 |
*1/*3 |
*1/*1 |
|
CYP2C9 |
*1/*1 |
*1/*2 |
*1/*2 |
*1/*2 |
|
CYP3A4 |
*1/*1 |
*1/*1 |
*1/*1 |
*1/*1 |
|
CYP3A5 |
*3/*3 |
*3/*3 |
*1/*3 |
*1/*3 |
|
CYP4F2 |
*1/*1 |
*3/*4 |
*1/*1 |
*1/*1 |
|
TPMT |
*1/*1 |
*1/*1 |
*1/*1 |
*1/*1 |
|
DPYD |
*1/*1 |
*1/*1 |
*4/*5 |
*5/*6 |
|
License
© 2017 Simon Fraser University, Indiana University Bloomington & Massachusetts Institute of Technology. All rights reserved.
Aldy is NOT a free software. Complete legal license is available in LICENSE file.
For non-legal folks, here is TL;DR version:
- Aldy can be freely used in academic and non-commercial environments
- Please contact us if you intend to use Aldy for any commercial purpose
Parameter documentation
NAME
aldy — perform allelic decomposition and exact genotyping on next-generation sequencing data
SYNOPSIS
aldy -h
aldy --test
aldy --license
aldy --generate-profile
aldy --show-cn --gene
aldy [--threshold THRESHOLD] [--profile PROFILE]
[--gene GENE] [--verbosity VERBOSITY] [--log LOG]
[--output OUTPUT] [--solver SOLVER] [--phase] [--cn CN]
[file]
OPTIONS
Positional arguments:
file
SAM, BAM, CRAM or DeeZ input file
Optional arguments:
-h, --help
Show the help message and exit
-T, --threshold THRESHOLD
Cut-off rate for variations (percent per copy)
default: 50
-p, --profile PROFILE
Sequencing profile. Currently, only “illumina”, “pgrnseq-v1” and “pgrnseq-v2” are supported. Please check --generate-profile
for more information how to use your own profile
-g, --gene GENE
Gene profile
default: CYP2D6
-v, --verbosity VERBOSITY
Logging verbosity. Acceptable values are T (trace), D (debug), I (info) and W (warn)
default: I
-l, --log LOG
Location of the output log file
default: [input].[gene].aldylog
-o, --output OUTPUT
Location of the output file
default: [input].[gene].aldy
-s, --solver SOLVER
IP Solver. Currently supported solvers are Gurobi and SCIP
default: any
-P, --phase
Phase reads
default: no
(slows down the pipeline)
--license
Print Aldy license
--test
Sanity-check on NA10860 sample
--show-cn
Show all copy number configurations supported by a gene (requires -g
)
-c, --cn CN
Manually set copy number (input: a comma-separated list CN1,CN2,…). For a list of supported configurations, please run aldy --show-cn
--generate-profile
Generate the copy-number profile for the custom sequencing panel and print it on the standard output
Ibrahim Numanagić
If you have an urgent question, I suggest using e-mail: GitHub issues are not handled as fast as the email requests.