AMP BLAB: The AMPLab Blog

SNAP Helps Save A Life

Patterson

When we got started in genomics 3 years ago, I wrote an essay in the New York Times that computer scientists have a lot to bring to the fight deadly diseases like cancer. (This hypothesis was not universally heralded by everyone in all fields.)

The good news we have already had a success of using SNAP for infectious disease to help save a life.

There are a number of patients in the hospital with undiagnosed inflammatory reactions and infectious symptoms who have no identifiable source using existing tests. A generic test to identify all organisms using RNA sequencing has been developed and is being piloted by Dr. Charles Chiu at UCSF. SNAP is critical to the implementation of this test, since it rapidly filters all human sequence from the large resulting datasets (up to 100 million sequences), enabling the identification of pathogenic sequence within a small enough time to effectively treat patients. In the US 20,000 people, mostly children, are hospitalized each year with brain-swelling encephalitis, and 60% of the time doctors never find the cause, which can lead to serious consequences.

Joshua Osborn, now 15, outside his home in Cottage Grove, Wis

Joshua Osborn, now 15, outside his home in Cottage Grove, Wis

This tool was recently used to successfully diagnose and treat a Joshua Osborn, a teenager with severe combined immunodeficiency who lives in Wisconsin. He went to hospital repeatedly, and was eventually hospitalized for 5 weeks without successful diagnosis. He developed brain seizures, so he was placed in a medically induced coma. In desperation, they asked his parents to try one more experimental test. His parents agreed, so they sampled his spinal fluid and sent it to UCSF for sequencing and analysis.

The speed and accuracy of SNAP helped UCSF to quickly filter out the human reads. In just 2 days they identified a rare infectious bacterium, which was only 0.02% of the original 3M reads. The boy was then treated with antibiotics for 10 days; he awoke and was discharged from the hospital 4 weeks later. Although our software is only part of this process, without SNAP it would not be possible to perform a general infectious disease test like this without first guessing the causative agent. That’s why tests such as this are not yet more broadly available.

Quoting from the UCSF press release, referring indirectly in part to the speed and accuracy of SNAP:

“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete,” Chiu said.

The abstract and last paragraph from the NEJM article tells the story with more medical accuracy and brevity:

A 14-year-old boy with severe combined immunodeficiency presented three times to a medical facility over a period of 4 months with fever and headache that progressed to hydrocephalus and status epilepticus necessitating a medically induced coma. Diagnostic workup including brain biopsy was unrevealing. Unbiased next-generation sequencing of the cerebrospinal fluid identified 475 of 3,063,784 sequence reads (0.016%) corresponding to leptospira infection. Clinical assays for leptospirosis were negative. Targeted antimicrobial agents were administered, and the patient was discharged home 32 days later with a status close to his premorbid condition. Polymerase-chain-reaction (PCR) and serologic testing at the Centers for Disease Control and Prevention (CDC) subsequently confirmed evidence of Leptospira santarosai infection.

… In summary, unbiased next-generation sequencing coupled with a rapid bioinformatics pipeline provided a clinically actionable diagnosis of a specific infectious disease from an uncommon pathogen that eluded conventional testing for months after the initial presentation. This approach thus facilitated the use of targeted and efficacious antimicrobial therapy.

In a separate communication with Chiu he said that in the United States, there are about 15,000 cases a year of brain-swelling encephalitis with 2,000 deaths, and >70% of the deaths underdiagnosed. Assuming doctors are able to obtain actionable diagnoses from the information, SNAP plus the software developed at UCSF to identify the non-human reads (SUPRI) could potentially save the lives hundreds of encephalitis patients annually just in the US. Worldwide, encephalitis is a huge problem; there are probably about 70,000 diagnosed cases a year with 25,000 deaths. Even 25,000 is certainly a gross underestimate because most cases worldwide are in rural areas and do not receive hospital care.

Here are links to

  • the New York Times article,
  • the press release from UCSF,
  • the New England Journal of Medicine article that describes the boy’s treatment, and
  • the technical paper in Genomics Research that describes the UCSF work that discovered the disease, which talks about the use of SNAP and SUPRI and cites our paper.
  • New job opportunity for a Solutions Architect

    massie

    The AMPLab participates in the Defense Advanced Research Project Agency (DARPA) XDATA Program which supports the development of open-source systems for processing, analyzing and visualizing large, imperfect and incomplete data. Many of the XDATA teams rely on the Berkeley Data Analytics Stack (BDAS) as the foundation of their XDATA applications.

    The AMPLab is looking for a Solutions Architect to work side-by-side with DARPA teams in application design and deployment on DARPA infrastructure. This is a great opportunity to work with a diverse group of researchers and software engineers in government, academia and industry to solve important real-world problems using open-source software.

    The AMPLab has a culture of “causal intensity” which will be familiar to anyone from a tech startup. We work hard and play hard — flexible work hours, work from home days, team lunches and retreats twice a year in Tahoe and Santa Cruz. Retreats are attended by our sponsors who give us feedback on the tough data problems they’d like us to solve. This close connection to government and industry have allowed us to create systems, like SparkTachyonSharkGraphXMLBase, with outstanding performance and features. All BDAS components are released under Apache, BSD or similar open-source licenses.

    The AMPLab also collaborates with other Data Science teams across the Berkeley campus comprised of experts for a broad range of specialties including computer science, statistics, data visualization and social sciences. The Berkeley Institute for Data Science (BIDS) is a new initiative focused on developing innovative partnerships to advance technologies that support advanced data management and data analytic techniques.

    This position requires occasional travel to Washington D.C. for the DARPA XDATA summer workshops. These workshops bring together teams from across the U.S. to build end-to-end solutions to specific data challenges. The AMPLab will, of course, cover all travel expenses. These onsite visits are an important part of creating systems that integrate together well. DARPA also brings in guest speakers from other agencies, such as the Pentagon, DEA and others, to talk about the data challenges they face.

    To apply, visit the UC, Berkeley Jobs website and keyword search for 17573, the Job ID for this position.

     

    Benefiting from Science is a Human Right!

    Patterson

    I was just at the Global Alliance for Genomics and Health (GA4GH) last week in London, and a person from the regulatory and ethics group quoted the United Nations about the rights for people to benefit from scientific progress:

    Article 27.1 of The Universal Declaration of Human Rights states
    “Everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits.”

    As almost all nations have signed this treaty, it has the power of law.

    This declaration is a big deal in genomics because, under the argument of protecting their privacy, some experts would prevent people from benefiting from scientific progress by preventing them from sharing their genomes with scientists. The experts argue that people don’t understand what they are doing, and hence they must be protected from themselves. Article 27.1 is a counterargument, in that benefiting participating in and benefiting from scientific progress is a human right, and you must be careful not to trample human rights

    PLANET: Making Progress with Commits in Unpredictable Environments

    gpang

    Recent trends in databases, such as multi-tenancy and cloud databases, can contribute to increases in delays and variation for transaction response times. For geo-replicated databases, this issue can be even worse, since the network delays between data centers can be very unpredictable. Developers have two options when transactions are delayed unpredictably: they may choose to wait indefinitely for the transaction to complete, or they may timeout early and be uncertain of the transaction outcome. Neither outcome is desirable, so we propose a new transaction programming model to help with this situation.

    planet_logo_small

    We propose PLANET (Predictive Latency-Aware NEtworked Transactions), which is a new transaction programming model that helps developers write flexible transactions. By using PLANET, developers can write transactions to better deal with the variation in latencies possible in modern database environments. PLANET provides the developer with more information of the progress and stages of the transaction, so the developer can define how the transaction should behave in different situations. By exposing more details, PLANET enables developers to express user-defined commits for their transactions. PLANET also exposes a commit likelihood calculation for transactions, which can be utilized for user-defined commits as well as admission control. By using the features of PLANET, systems can provide responsive transactions, as well as better system utilization for cloud databases.

    More information about PLANET can be found at our PLANET website. There, we have a simple interactive visualization to demonstrate how some of the features of PLANET work. We also have examples of several use cases of the transaction programming model.

    planet_stages

    New BDAS Features

    jegonzal

    In this post we briefly describe four of the newest features in the Berkeley Data Analytics Stack (BDAS):

    1. GraphX: large-scale interactive graph analytics
    2. BlinkDB: real-time analytics through bounded approximations
    3. MLbase: scalable machine learning library
    4. Tachyon: reliable data sharing at memory speed across cluster frameworks

    GraphX

    GraphX is a distributed interactive graph computation system integrated with Spark.  GraphX exposes a new API that treats tables and graphs as composable objects enabling users to easily construct graphs from tables, join graphs and tables, and apply iterative graph algorithms (e.g., PageRank and community detection) using Pregel like operators.  The GraphX system combines recent advances in graph processing systems with distributed join optimizations and lineage to efficiently execute distributed graph computation in the context of fully fault-tolerant data-parallel platform.  On top of GraphX we have built a collection of useful tools for graph analytics.

    BlinkDB

    BlinkDB is a large-scale data warehouse system built on Shark and Spark that  aims to achieve real-time (i.e., sub-second) query response times for a variety of SQL-based aggregation queries (augmented by a time and/or error bound) on massive amounts of data. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results.

    MLbase

    The MLbase project’s aim is to provide fast, scalable, and easy to use Machine Learning on top of Spark and is composed of three core components.   MLlib is the first production-ready component of the MLbase project and is the standard library for machine learning on Spark.  Released as part of Spark 0.8.0, MLlib includes fast, scalable algorithms for classification, regression, clustering, collaborative filtering, and convex optimization.  The second component of MLbase, MLI, is an API for distributed machine learning. By offering an API that is familiar to Machine Learning developers, MLI provides a DSL for development of new machine learning algorithms on top of Spark.  The final component of MLbase, the MLbase Optimizer, attempts to automate the process of model selection and ML pipeline construction, and is an area of active research within the AMPlab.

    Tachyon

    Tachyon is a fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.

    Learn more at Strata

    At Strata 2014 we will be hosting AMPCamp4 with hands-on exercises (available online as well) to help people get started using BDAS and the exciting new features we have been developing.

    You can register here for AMP Camp 4 and the 2014 Strata Conference. Use the code AMP20 when registering to get 20% off your ticket price. The conference offers one day of tutorials (Feb 11) and two says of presentations (Feb 12-13). Please make sure to select the tutorials day if you wish to join us at the AMP Camp.

     

    Large scale data analysis made easier with SparkR

    shivaram

    R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

    In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

    Some of the key features of SparkR include:

    RDDs as Distributed Lists: SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD. In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

    Serializing closures: SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster.

    Using existing R packages: SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster.

    Putting these features together in R can be very powerful. For example, the code to compute Logistic Regression using gradient descent is listed below. In this example, we read a file from HDFS in parallel using Spark and run a user-defined gradient function in parallel using lapplyPartition. Finally the weights from different machines are accumulated using reduce.

    pointsRDD <- readMatrix(sc, "hdfs://myfile")
    # Initialize weights
    weights <- runif(n=D, min = -1, max = 1)
    # Logistic gradient
    gradient <- function(partition) {
        X <- partition[,1]; Y <- partition[-1]
        t(X) %*% (1/(1 + exp(-Y * (X %*% w))) - 1) * Y
    }
    for (i in 1:10) {
        weights <- weights - reduce(lapplyPartition(pointsRDD, 
            gradient), "+")
    }

    Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at http://amplab-extras.github.io/SparkR-pkg.

    Top 5 most influential works in data science?

    Patterson

    As part of the Data-Driven Discovery Investigator Competition from the Gordon and Betty Moore Foundation, they ask for

    five references to the most influential work in data science in the applicant’s view. This is distinct from the bio-sketch references and will not be factor in the Foundation’s decision-making. This information will help the Foundation better understand the influential ideas related to data-driven discovery and data science.

    After talking to others in the lab, below is my list, sorted in order of citations according to Google Scholar. Love to hear comments on these and/or suggestions of others I missed.

      1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … & Grafham, D. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. (16,000 citations)

    The Human Genome Project turned the secret of life into into digital information. On January 14, 2014 Illumina announced a new sequencing machine that can do the wet lab processing of a genome for $1000. This price is widely believed to be a tipping point, and soon millions will have their genomes sequenced. At 25 to 250 gigabytes per genome, genetics is now Big Data.

      2. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. (9,200 citations)

    A simple, easy-to-use programming model to process Big Data. It led to the No-SQL movement, Hadoop, many startup companies, and awards for its authors.

      3. Blei, D., Ng, A., & Jordan, M. I. (2003).  Latent Dirichlet allocation.  Journal of Machine Learning Research, 3, 993-1022. (7,300 citations)

    LDA allows sets of observations to be explained by unobserved groups. It spawned an entire industry of data-driven discovery for text and image corpora.
      4. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Ion Stoica & Zaharia, M. (2010). A view of cloud computing. Communications of the ACM, 53(4), 50-58. (5,800 citations)

    At a time when there was confusion as to what cloud computing was, it defined cloud computing, explained why it occurred now, and listed its challenges and opportunities.

      5. Stoughton, C., Lupton, R. H., Bernardi, M., Blanton, M. R., Burles, S., Castander, F. J., … & Carey, L. (2002). Sloan digital sky survey: Early data release. The Astronomical Journal, 123(1), 485. (2,100 citations)

    Aided by computer scientist Jim Gray, astronomers made raw astronomical data available to a much wider community. It led to crowd-sourcing of astronomy through projects like Galaxy Zoo, so now anyone could help with astronomy research.