The importance of provenance for cancer genomics

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

(Co-written with Michael Armbrust)

In order to motivate our research agenda, the AMP Lab has partnered with several other groups that have “big data” problems including traffic prediction, urban simulation, and cancer genomics. For example, we are collaborating with Taylor Sittler, a computer-savvy pathology resident at UCSF, addressing problems like providing efficient access and storage of sequencing data in addition to extracting insights from a variety of clinical datasets. While this collaboration is still in the early phases, students in the lab have already started to port segments of the sequencing pipeline to take advantage of the computation speed provided by Spark, and we are planning to offer a graduate seminar on the topic this fall.

Genomics data comes from a vast number of sources, varies substantially in schema and semantics, and is being produced at an ever increasing rate (for example, the TCGA and the 1000 Genomes project). Additionally, the time to answer for questions about what treatment a patient should receive must be very short if patients are to be given effective care. These factors make working with genomic data a great fit for the goals of the lab.

We recently came across an article in the New York Times about a disappointment in research on cancer genomics. Some researchers at Duke University created tests to determine which treatments a particular patient should receive based on the tumor’s genetics. Statisticians asked to check into the data analysis found glaring errors. However, no one paid much attention to their findings until the community learned that the research’s lead author had falsified his resume. The research was later discredited. Tragically, a patient treated according to the tests died; her family ended up suing Duke.

For the AMP Lab, we have several takeaways from this article. We can’t expect our application partners to blindly trust the results of any big data analysis. The article quotes a senior scientist who states that because the workflows used in genomic analysis alprazolam online are complex and involve different types of expertise (biology and systems), no single person can be expected to understand the whole thing. Another scientist, one of the statisticians involved in discrediting the Duke work, points out that intuition often fails when looking at such large and diverse datasets.

Therefore, simply providing an answer (eg, treat patient X with drug Y) is insufficient; we also need to support our conclusions. In this case, the Duke scientists were dealing with a rather small amount of data; in fact, it could all fit in a spreadsheet, which enabled the statisticians asked to check the work to find errors pretty easily. Spreadsheets are actually a success story in this case, as formulae make it easy to see how answers are derived from base data.

However, with big, heterogeneous data and complex workflows, it won’t be so easy to find errors. Thus, one of the goals of the AMP Lab is to develop tools that help our users debug dataflows and record result provenance, while remaining as straightforward to use as if the data were small (ie, if it could fit in a spreadsheet).