Aldy

Aldy is a tool for allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. More simply, it is a tool which can detect the copy number of a target gene, and the structure and genotype of each gene copy present in the sample.

Aldy has been published in Nature Communications (doi:10.1038/s41467-018-03273-1). Preprint is available here. Full experimental pipeline is available here.

Installation

Aldy is written in Python, and supports both Python 2.7 and 3. It is intended to be run on POSIX-based systems (only Linux and macOS have been tested).

The easiest way to install Aldy is to use pip:

pip install git+https://github.com/inumanag/aldy.git

Add --user to install it locally if you cannot write to the system-wide Python directory.

Prerequisite: ILP solver

Although Aldy can be installed without any ILP solver, it cannot produce any meaningful results without supported ILP solver. Unfortunately, as of now, none of the currently supported solvers can be installed automatically.

Solvers which are currently supported include:

  • Gurobi (highly recommended): a commercial solver which is free for academic purposes. After installing it, don’t forget to install gurobipy package by going to Gurobi’s installation directory (e.g. /opt/gurobi/linux64 on Linux or /Library/gurobi751/mac64/ on macOS) and issuing python setup.py install.
  • SCIP: another solver that is also free for academic purposes, and that is bit easier to obtain than Gurobi (no registration required). However, please note that it is slower than Gurobi. Please bear in mind that the results from the paper were generated by Gurobi, although SCIP should provide the same results. Once you install SCIP, please install PySCIPPpt module for the Python SCIP bindings.

Sanity check

After installing Aldy and the supported ILP solver, please make sure to test the installation by issuing the following command (this should take around a minute):

aldy --test

In case everything is set up properly, you should see something like this:

Aldy Sanity-Check Test
Expected result is: *1/*4+*4

*** Aldy v1.2 (Python 3.6) ***
(c) 2017 SFU, MIT & IUB. All rights reserved.
Arguments:
  Gene:      CYP2D6
  Profile:   illumina
...[redacted]...
Result:
  *1/*4+*4                       (1, 4N, 4AW)

Running

Aldy needs an aligned SAM, BAM, CRAM or DeeZ file for the analysis. Below we will use BAM as an example.

It is assumed that reads are mapped to hg19 or GRCh37. hg38 is not yet supported.

An index is needed for BAM files. Get one by running:

samtools index file.bam

Aldy is invoked as:

aldy -p [profile] -g [gene] file.bam

The [gene] parameter indicates the name of the gene to be genotyped. Currently, Aldy supports CYP2D6, CYP2A6, CYP2C19, CYP2C8, CYP2C9, CYP3A4, CYP3A5, CYP4F2, TPMT and DPYD.

Sequencing profile selection

The [profile] argument refers to the sequencing profile. The following profiles are available:

  • illumina for Illumina WGS (or any uniform-coverage technology) samples.

    It is highly recommended to use samples with at least 40x coverage. Also, anything lower than 20x will result in tears and agony.

  • pgrnseq-v1 for PGRNseq v.1 capture protocol data
  • pgrnseq-v2 for PGRNseq v.2 capture protocol data

In case you are using different technology (i.e. some home-brewed capture kit), you can use it provided that the following requirements are met:

  • all samples have the similar coverage distribution (i.e. two sequenced samples with the same copy number configuration MUST have similar coverage profiles; please consult us if you are not sure about this)
  • your panel includes a copy-number neutral region (currently, Aldy uses CYP2D8 as a copy-number neutral region)

Having said that, you can generate the profile for your panel/technology by running:

# Get the profile
aldy --generate-profile file.bam > my-cool-tech.profile
# Run Aldy
aldy -p my-cool-tech.profile -g [gene] file.bam

Output

Aldy will generate the following files: file-[gene].aldy (default location can be changed via -o parameter), and file-[gene].aldylog (default location can be changed via -l parameter). The summary of results are shown at the end of the output:

$ aldy -p pgrnseq-v2 -g cyp2d6 NA19788_x.bam
*** Aldy v1.0 ***
(c) 2017 SFU, MIT & IUB. All rights reserved.
Arguments:
  Gene:      CYP2D6
  Profile:   pgrnseq-v2
  Threshold: 50%
  Input:     NA19788_x.bam
  Output:    NA19788_x.CYP2D6.aldy
  Log:       NA19788_x.CYP2D6.aldylog
Result:
  *2/*78+*2                      (2MW, 2MW, 78/2|2M)

In this example, CYP2D6 genotype is *2/*78+*2 as expressed in terms of major star-alleles. Minor star-alleles are given in the parenthesis (in this case, two copies of *2MW, and one copy of *78 fusion on the *2M background).

Explicit decomposition is given in the file-[gene].aldy (in the example above, it is NA19788_x.CYP2D6.aldy). Example of such file is:

# Aldy v1.0
# Gene: CYP2D6
# Number of solutions: 1

# Solution 0
# Predicted diplotype: *2/*78+*2
# Composition: 2MW,2MW,78/2|2M
Copy   Allele   Location   Type     Coverage  Effect      dbSNP       Code        Status
0      78/2     42522311   SNP.CT   1760      NEUTRAL     rs12169962  4481:G>A    NORMAL
0      78/2     42522612   SNP.CG   1287      DISRUPTING  rs1135840   4180:G>C    NORMAL
...[redacted]...
1      2MW      42522311   SNP.CT   1760      NEUTRAL     rs12169962  4481:G>A    NORMAL
1      2MW      42527541   DEL.TC   0         NEUTRAL     rs536645539 -750:delGA  MISSING
...[redacted]...

Each solution is indicated with the “Solution” line. The first column (copy) shows the ordinary number of the allelic copy (e.g. 0, 1 and 2 for 2MW, 2MW and 78/2M, respectively). The following columns indicate:

  • star-allele,
  • mutation loci,
  • mutation type (SNP or indel),
  • mutation coverage,
  • mutation functionality:
    • DISRUPTING for gene-disrupting
    • NEUTRAL for neutral mutation,
  • dbSNP ID (if available),
  • traditional Karolinska-style mutation code from CYP allele database, and
  • mutation status, which indicates the status of the mutation in the decomposition:
    • NORMAL: mutation is associated with the star-allele in the database, and is found in the sample
    • NOVEL: gene-disrupting mutation is NOT associated with the star-allele in the database, but is found in the sample (this indicates that Aldy found a novel major star-allele)
    • EXTRA: neutral mutation is NOT associated with the star-allele in the database, but is found in the sample (this indicates that Aldy found a novel minor star-allele)
    • MISSING: neutral mutation is associated with the star-allele in the database, but is NOT found in the sample (this also indicates that Aldy found a novel minor star-allele)

Logging

Detailed execution log will be located in file-[gene].aldylog. It is used mainly for debugging purposes.

Sample datasets

Sample datasets are also available for download. They include:

  • HG00463 (PGRNseq v.2), containing CYP2D6 configuration with multiple copies
  • NA19790 (PGRNseq v.2), containing a fusion between CYP2D6 and CYP2D7 deletion (*78 allele)
  • NA24027 (PGRNseq v.1), containing novel DPYD allele and multiple copies of CYP2D6
  • NA10856 (PGRNseq v.1), containing CYP2D6 deletion (*5 allele)
  • NA10860 (Illumina WGS), containing 3 copies of CYP2D6. This sample contains only CYP2D6 region.

Expected results are:

Gene (-g) HG00463 NA19790 NA24027 NA10856 NA10860
CYP2D6 *36+*10/*36+*10 *1/*78+*2 *6/*2+*2 *1/*5 *1/*4+*4
CYP2A6 *1/*1 *1/*1 *1/*35 *1/*1  
CYP2C19 *1/*3 *1/*1 *1/*2 *1/*2  
CYP2C8 *1/*1 *1/*3 *1/*3 *1/*1  
CYP2C9 *1/*1 *1/*2 *1/*2 *1/*2  
CYP3A4 *1/*1 *1/*1 *1/*1 *1/*1  
CYP3A5 *3/*3 *3/*3 *1/*3 *1/*3  
CYP4F2 *1/*1 *3/*4 *1/*1 *1/*1  
TPMT *1/*1 *1/*1 *1/*1 *1/*1  
DPYD *1/*1 *1/*1 *4/*5 *5/*6  

License

© 2017 Simon Fraser University, Indiana University Bloomington & Massachusetts Institute of Technology. All rights reserved.

Aldy is NOT a free software. Complete legal license is available in LICENSE file.

For non-legal folks, here is TL;DR version:

  • Aldy can be freely used in academic and non-commercial environments
  • Please contact us if you intend to use Aldy for any commercial purpose

Parameter documentation

NAME

aldy — perform allelic decomposition and exact genotyping on next-generation sequencing data

SYNOPSIS

aldy -h
aldy --test
aldy --license
aldy --generate-profile
aldy --show-cn --gene
aldy [--threshold THRESHOLD] [--profile PROFILE]
     [--gene GENE] [--verbosity VERBOSITY] [--log LOG]
     [--output OUTPUT] [--solver SOLVER] [--phase] [--cn CN]
     [file]

OPTIONS

Positional arguments:

  • file
    SAM, BAM, CRAM or DeeZ input file

Optional arguments:

  • -h, --help
    Show the help message and exit
  • -T, --threshold THRESHOLD
    Cut-off rate for variations (percent per copy)
    default: 50
  • -p, --profile PROFILE
    Sequencing profile. Currently, only “illumina”, “pgrnseq-v1” and “pgrnseq-v2” are supported. Please check --generate-profile for more information how to use your own profile
  • -g, --gene GENE
    Gene profile
    default: CYP2D6
  • -v, --verbosity VERBOSITY
    Logging verbosity. Acceptable values are T (trace), D (debug), I (info) and W (warn)
    default: I
  • -l, --log LOG
    Location of the output log file
    default: [input].[gene].aldylog
  • -o, --output OUTPUT
    Location of the output file
    default: [input].[gene].aldy
  • -s, --solver SOLVER
    IP Solver. Currently supported solvers are Gurobi and SCIP
    default: any
  • -P, --phase
    Phase reads
    default: no (slows down the pipeline)
  • --license
    Print Aldy license
  • --test
    Sanity-check on NA10860 sample
  • --show-cn
    Show all copy number configurations supported by a gene (requires -g)
  • -c, --cn CN
    Manually set copy number (input: a comma-separated list CN1,CN2,…). For a list of supported configurations, please run aldy --show-cn
  • --generate-profile
    Generate the copy-number profile for the custom sequencing panel and print it on the standard output

Contact & Bug Reports

Ibrahim Numanagić

If you have an urgent question, I suggest using e-mail: GitHub issues are not handled as fast as the email requests.