Collaboration + Open Source = Research Impact

Error: Unable to create directory uploads/2025/02. Is its parent directory writable by the server?

The AMPLab was launched in 2011 and has roots going back to 2009 in the earlier RAD Lab project at Berkeley.   Throughout that time, we’ve had a steady stream of research results and have had a large presence in the top publishing venues in Database Systems, Computing Systems and Machine Learning.  However, in the past few months we’ve seen some real indicators that our work is having impact beyond the traditional expectations of a university-based research project.

Clearly, the Spark system, which was developed in the lab, is having a huge impact in the growing Big Data analytics ecosystem.   This week the 2nd Spark Summit is being held in San Francisco.   Tickets to the summit sold out early, with over 1000 attendees for the two-day event, and over 300 people signed up for an in-depth training session (based on our successful AMPCamp series) on the 3rd day.   Spark is now included in all the major Hadoop distributions, and is leading the way in many technical areas, including support for database queries (SQL-on-Hadoop), distributed machine learning, and large-scale graph processing.

Spark is one part of the larger Berkeley Data Analytics Stack (BDAS), which serves as a unifying framework for much of the research being done in the AMPLab.   Students and researchers in the lab continue to expand, improve, and extend the capabilities of BDAS.   Recent additions include the Tachyon in-memory file system, the BlinkDB approximate query processing platform, and even the SparkR interface that allows programs written in the popular statistics language R to run distributed across a spark cluster.   BDAS provides a research context for the varied projects going on in the lab, and gives students the opportunity to address a ready audience of potential users and collaborators.   For example, the SparkR project started off as a class project, but took on a life of its own when some BDAS users found the code on-line and started using it.

A recent post on this blog by Dave Patterson describes another example of real-world impact that comes from the unique combination of collaboration across research domains and development of working systems used in the lab.  When we started the lab several years ago, we identified genomics, and particularly cancer genomics as an important application use case for the BDAS stack; one that could have an impact on a complex and persistent societal problem.   Dave’s motivation was the conviction that if genomics research was becoming increasingly data-intensive, then Computer Scientists focused on data analytics should be able to contribute.   As you can read in the blog post, spark-based code developed in this project has already had real impact, being used to help diagnose a rare life-threatening infectious disease in a young patient, much faster than had been done previously.

Of course, beyond the outsized impacts listed above, we continue to do what any good university research lab does, producing some of the top students graduating across all the fields we work in, and pushing the envelope on the the research agendas of key areas such as Big Data analytics, cloud computing, and all things data.   The research model developed at Berkeley over the past couple decades emphasizes collaboration across domains and continuous development of working systems that embody the research ideas.   In my experience, this combination makes for a vibrant and productive research and learning environment and also happens to make research a lot more fun.

SNAP Helps Save A Life

Error: Unable to create directory uploads/2025/02. Is its parent directory writable by the server?

When we got started in genomics 3 years ago, I wrote an essay in the New York Times that computer scientists have a lot to bring to the fight deadly diseases like cancer. (This hypothesis was not universally heralded by everyone in all fields.)

The good news we have already had a success of using SNAP for infectious disease to help save a life.

There are a number of patients in the hospital with undiagnosed inflammatory reactions and infectious symptoms who have no identifiable source using existing tests. A generic test to identify all organisms using RNA sequencing has been developed and is being piloted by Dr. Charles Chiu at UCSF. SNAP is critical to the implementation of this test, since it rapidly filters all human sequence from the large resulting datasets (up to 100 million sequences), enabling the identification of pathogenic sequence within a small enough time to effectively treat patients. In the US 20,000 people, mostly children, are hospitalized each year with brain-swelling encephalitis, and 60% of the time doctors never find the cause, which can lead to serious consequences.

Joshua Osborn, now 15, outside his home in Cottage Grove, Wis

Joshua Osborn, now 15, outside his home in Cottage Grove, Wis

This tool was recently used to successfully diagnose and treat a Joshua Osborn, a teenager with severe combined immunodeficiency who lives in Wisconsin. He went to hospital repeatedly, and was eventually hospitalized for 5 weeks without successful diagnosis. He developed brain seizures, so he was placed in a medically induced coma. In desperation, they asked his parents to try one more experimental test. His parents agreed, so they sampled his spinal fluid and sent it to UCSF for sequencing and analysis.

The speed and accuracy of SNAP helped UCSF to quickly filter out the human reads. In just 2 days they identified a rare infectious bacterium, which was only 0.02% of the original 3M reads. The boy was then treated with antibiotics for 10 days; he awoke and was discharged from the hospital 4 weeks later. Although our software is only part of this process, without SNAP it would not be possible to perform a general infectious disease test like this without first guessing the causative agent. That’s why tests such as this are not yet more broadly available.

Quoting from the UCSF press release, referring indirectly in part to the speed and accuracy of SNAP:

“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete,” Chiu said.

The abstract and last paragraph from the NEJM article tells the story with more medical accuracy and brevity:

A 14-year-old boy with severe combined immunodeficiency presented three times to a medical facility over a period of 4 months with fever and headache that progressed to hydrocephalus and status epilepticus necessitating a medically induced coma. Diagnostic workup including brain biopsy was unrevealing. Unbiased next-generation sequencing of the cerebrospinal fluid identified 475 of 3,063,784 sequence reads (0.016%) corresponding to leptospira infection. Clinical assays for leptospirosis were negative. Targeted antimicrobial agents were administered, and the patient was discharged home 32 days later with a status close to his premorbid condition. Polymerase-chain-reaction (PCR) and serologic testing at the Centers for Disease Control and Prevention (CDC) subsequently confirmed evidence of Leptospira santarosai infection.

… In summary, unbiased next-generation sequencing coupled with a rapid bioinformatics pipeline provided a clinically actionable diagnosis of a specific infectious disease from an uncommon pathogen that eluded conventional testing for months after the initial presentation. This approach thus facilitated the use of targeted and efficacious antimicrobial therapy.

In a separate communication with Chiu he said that in the United States, there are about 15,000 cases a year of brain-swelling encephalitis with 2,000 deaths, and >70% of the deaths underdiagnosed. Assuming doctors are able to obtain actionable diagnoses from the information, SNAP plus the software developed at UCSF to identify the non-human reads (SUPRI) could potentially save the lives hundreds of encephalitis patients annually just in the US. Worldwide, encephalitis is a huge problem; there are probably about 70,000 diagnosed cases a year with 25,000 deaths. Even 25,000 is certainly a gross underestimate because most cases worldwide are in rural areas and do not receive hospital care.

Here are links to

  • the New York Times article,
  • the press release from UCSF,
  • the New England Journal of Medicine article that describes the boy’s treatment, and
  • the technical paper in Genomics Research that describes the UCSF work that discovered the disease, which talks about the use of SNAP and SUPRI and cites our paper.