Collaboration + Open Source = Research Impact

Michael Franklin

The AMPLab was launched in 2011 and has roots going back to 2009 in the earlier RAD Lab project at Berkeley.   Throughout that time, we’ve had a steady stream of research results and have had a large presence in the top publishing venues in Database Systems, Computing Systems and Machine Learning.  However, in the past few months we’ve seen some real indicators that our work is having impact beyond the traditional expectations of a university-based research project.

Clearly, the Spark system, which was developed in the lab, is having a huge impact in the growing Big Data analytics ecosystem.   This week the 2nd Spark Summit is being held in San Francisco.   Tickets to the summit sold out early, with over 1000 attendees for the two-day event, and over 300 people signed up for an in-depth training session (based on our successful AMPCamp series) on the 3rd day.   Spark is now included in all the major Hadoop distributions, and is leading the way in many technical areas, including support for database queries (SQL-on-Hadoop), distributed machine learning, and large-scale graph processing.

Spark is one part of the larger Berkeley Data Analytics Stack (BDAS), which serves as a unifying framework for much of the research being done in the AMPLab.   Students and researchers in the lab continue to expand, improve, and extend the capabilities of BDAS.   Recent additions include the Tachyon in-memory file system, the BlinkDB approximate query processing platform, and even the SparkR interface that allows programs written in the popular statistics language R to run distributed across a spark cluster.   BDAS provides a research context for the varied projects going on in the lab, and gives students the opportunity to address a ready audience of potential users and collaborators.   For example, the SparkR project started off as a class project, but took on a life of its own when some BDAS users found the code on-line and started using it.

A recent post on this blog by Dave Patterson describes another example of real-world impact that comes from the unique combination of collaboration across research domains and development of working systems used in the lab.  When we started the lab several years ago, we identified genomics, and particularly cancer genomics as an important application use case for the BDAS stack; one that could have an impact on a complex and persistent societal problem.   Dave’s motivation was the conviction that if genomics research was becoming increasingly data-intensive, then Computer Scientists focused on data analytics should be able to contribute.   As you can read in the blog post, spark-based code developed in this project has already had real impact, being used to help diagnose a rare life-threatening infectious disease in a young patient, much faster than had been done previously.

Of course, beyond the outsized impacts listed above, we continue to do what any good university research lab does, producing some of the top students graduating across all the fields we work in, and pushing the envelope on the the research agendas of key areas such as Big Data analytics, cloud computing, and all things data.   The research model developed at Berkeley over the past couple decades emphasizes collaboration across domains and continuous development of working systems that embody the research ideas.   In my experience, this combination makes for a vibrant and productive research and learning environment and also happens to make research a lot more fun.