View on GitHub

hail-vep-pipeline

AWS RODA Documentation

VEP and LOFTEE Plugin

Background

Hail is an open-source, general-purpose, Python-based data analysis library with additional data types and methods for working with genomic data. Hail’s been built to scale well horizontally as the workloads do, and has strong support for multi-dimensional, structured data like the genomic data in a genome-wide association study. Maintained by the Broad Institute, Hail has been widely adopted in academia and industry.

Hail can be used to annotate variants with the vep() method, which in turn leverages a plugin called LOFTEE (Loss-Of-Function Transcript Effect Estimator). These packages (VEP and LOFTEE) are required for certain deployments of Hail on Amazon EMR, and are hosted on Amazon Web Services in S3.

Variant Effect Predictor (VEP) Cache

The Variant Effect Predictor (VEP) from Ensembl, “determines the effects of your variants (SNPs, insertions, deletions, CNVs, or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.” Using a cache is the most efficient way to leverage VEP.

The vep folder in this dataset contains caches for:

for several recent versions of VEP.

Loss-Of-Function Transcript Effect Estimator (LOFTEE)

The loftee-data folder in this dataset contains optional data from the LOFTEE project for use by the Hail on Amazon EMR project. Further instructions on usage can be found in the project repository.