The importance of provenance for cancer genomics

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?
(Co-written with Michael Armbrust)

In order to motivate our research agenda, the AMP Lab has partnered with several other groups that have “big data” problems including traffic prediction, urban simulation, and cancer genomics.  For example, we are collaborating with Taylor Sittler, a computer-savvy pathology resident at UCSF, addressing problems like providing efficient access and storage of sequencing data in addition to extracting insights from a variety of clinical datasets.  While this collaboration is still in the early phases, students in the lab have already started to port segments of the sequencing pipeline to take advantage of the computation speed provided by Spark, and we are planning to offer a graduate seminar on the topic this fall.

Genomics data comes from a vast number of sources, varies substantially in schema and semantics, and is being produced at an ever increasing rate (for example, the TCGA and the 1000 Genomes project).  Additionally, the time to answer for questions about what treatment a patient should receive must be very short if patients are to be given effective care.  These factors make working with genomic data a great fit for the goals of the lab.

We recently came across an article in the New York Times about a disappointment in research on cancer genomics.  Some researchers at Duke University created tests to determine which treatments a particular patient should receive based on the tumor’s genetics.  Statisticians asked to check into the data analysis found glaring errors.  However, no one paid much attention to their findings until the community learned that the research’s lead author had falsified his resume.  The research was later discredited.  Tragically, a patient treated according to the tests died; her family ended up suing Duke.

For the AMP Lab, we have several takeaways from this article.  We can’t expect our application partners to blindly trust the results of any big data analysis.  The article quotes a senior scientist who states that because the workflows used in genomic analysis alprazolam online are complex and involve different types of expertise (biology and systems), no single person can be expected to understand the whole thing.  Another scientist, one of the statisticians involved in discrediting the Duke work, points out that intuition often fails when looking at such large and diverse datasets.

Therefore, simply providing an answer (eg, treat patient X with drug Y) is insufficient; we also need to support our conclusions.  In this case, the Duke scientists were dealing with a rather small amount of data; in fact, it could all fit in a spreadsheet, which enabled the statisticians asked to check the work to find errors pretty easily.  Spreadsheets are actually a success story in this case, as formulae make it easy to see how answers are derived from base data.

However, with big, heterogeneous data and complex workflows, it won’t be so easy to find errors.  Thus, one of the goals of the AMP Lab is to develop tools that help our users debug dataflows and record result provenance, while remaining as straightforward to use as if the data were small (ie, if it could fit in a spreadsheet).

AMP – A holistic approach to making sense of Big Data

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Big Data is all the rage right now – articles in newspapers and the popular press, a steady stream of books about “Super-crunchers” and “Numerati”, conferences on Big Data, new and newly-acquired companies, and the appearance of a class of workers called Data Scientists are just some of the move visible manifestations of this trend.

Why all the excitement?   First of all, the amount of data available to be analyzed continues to grow as more and more human activity moves on-line.   Second, continued cost improvements in processing power and storage, advances in scalable software architectures, and the emergence of cloud computing are enabling organizations of all types to consider trying to work with this data.  This combination of data and analytics capacity hold the promise to enable data-driven decision making – a holy grail of information technology from its earliest days.

Much of the industry and research activity around Big Data is focused on scalability to address the increasing volume of data.   But data size is a problem for which there are known solutions.   The real challenge problems in data analytics come from other causes.   Some of the hardest problems stem from data heterogeneity.  As more and more information is collected from more and more places, making sense of that information in a unified way becomes increasingly difficult.

Another source of problems comes not from the data but from the queries or questions asked over that data.  Queries are often exploratory in nature and may be under- or over- specified.  Furthermore, as one looks for more and more needles in bigger and bigger haystacks, it becomes possible to find evidence to support all sorts of hypotheses, even if that evidence is the result of randomness rather than actual causality.   Finally, as we cast a wider net for both data and queries, we must deal with increasing ambiguity and must support decision making and predictions over incomplete, uncertain data.

Addressing these problems requires more than simply scaling up existing systems and data analysis algorithms.  Instead, what is needed is a rethinking about how analytics platforms are built.   Stepping back a bit, the key resources we have for making sense of data are Algorithms, in the form of statistical machine learning and other data analytics methods; Machines, in the form of cluster and cloud computing, and People, who not only provide data and pose questions, but who can also help out with understanding the data and can address the “ML-Hard” questions that are not currently solvable by computers alone.

AMP DimensionsMost existing data-centric systems emphasize one of the three key resources.   The premise behind the AMPLab is that a more holistic approach is needed.   In other words, we want to attack the Big Data analytics problem by developing systems that more closely blend the resources of Algorithms, Machines, and People.

One way to think of this is as a “co-design” approach.   Rather than inventing new algorithms or designing new computing platforms in isolation, we are considering all of these resources together.   For example, the CrowdDB system invokes crowdsourcing to find missing database data, perform entity resolution, and to perform subjective comparisons.   CrowdDB is a combination of M (a traditional relational database system) and P (human input, obtained through Amazon Mechanical Turk and our own mobile platform) with a bit of A thrown in to do data cleaning and quality assessment.   As another example, recent progress by our machine learning group on parallelizing the Bootstrap method for confidence estimation is an instance of designing a new Algorithmic approach to cope with the emerging Machine environment of cluster computing.

The challenge of this type of integrated work is to get people whose expertise typically lies along one of the A, M, and P dimensions to be able to effectively work together to solve big problems.   We believe that the AMPLab is uniquely positioned to enable this collaboration, and as such, we look forward to contributing to the progress in making sense of Big Data.

Welcome to AMP Blab

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Welcome to the AMPLab blog.  We’ll be using this space to describe our on-going research and results; to give reports on events in our lab, department, and elsewhere at Berkeley; and to discuss trends and news items related to Big Data in general.   We’ll have contributions from all the researchers in the lab, and hopefully some of our application and industrial partners as well.   We look forward to being a part of the growing, global conversation about data analysis in research and in practice.