DNA Processing Pipeline

Another effort related to genomics underway at the AMP Lab involves developing a variant calling pipeline.  Variant calling is the process of translating the output of DNA sequencing machines, short reads, to a summary of the unique characteristics of the individual being sequenced, variants.  Variants are reported as differences between the individual and a reference genome.

SNAP, another AMP Lab project, is the first step of this pipeline.  SNAP performs sequence alignment, whereby each short read is assigned a location of the reference genome which it closely matches.  The rest of the variant calling pipeline combines the information scattered through the aligned reads into a complete picture of the individual’s unique genome.

A second is a new format for storing genomic in called ADAM. ADAM is a cluster friendly storage format for genetic information that embraces modern systems technology to accelerate other steps of the genomic processing software pipeline. For example, ADAM executes two of the most expensive steps 110 times faster using an 82-node cluster.

Another expensive step in a genomics pipeline is identifying the differences between the standard human reference and each person, named variant calling. Alas, it is slow, taking hundreds of hours per genome. Papers proposing new variant callers typically use unique data sets and metrics, as genetics benchmarks do not exist. Hence, an important question for pipelines is performance; i.e., how accurate are the variants that they call? This is difficult to determine when it is applied to real data, since we don’t know the ground truth. Thus, we are developing a suite of benchmarks for evaluating variant calling pipelines called. SMaSH, which is a variant calling benchmark suite with appropriate evaluation metrics. As there is no real ground truth for genetics—the technology cannot yet specify 3B base pairs perfectly—it is trickier than in CS. Just as CS fields accelerate when benchmarks are embraced, we hope that SMaSH will accelerate variant calling.

This pipeline is an important part of our effort regarding cancer genomics.  To analyze a tumor and identify important mutations that are relevant to choosing a treatment, the raw output of a DNA sequencing machine must be processed via this pipeline.