A Snapshot of Database Research in the AMPLab

Michael Franklin

*** SPOILER ALERT:  If you are an ACM SIGMOD reviewer and are concerned about anonymous “double blind” reviewing, stop reading here***

As the dust settles after a hectic ACM SIGMOD paper submission deadline, I found myself reflecting on the variety, scope and overall coolness of the Database Systems work that has been coming out of the lab at an increasing pace in recent months.  Topics include:

  • scalable query processing and machine learning in (what used to be called) NoSQL systems,
  • crowdsourced data management,
  • consistency and performance guarantees in multi-tenant and multi-data-center environments, and
  • low latency processing of streaming and high-velocity data.

The activity level during the lead-up to the SIGMOD deadline (which will correlate with a spike in our Cloud Computing bills – I’m sure)  was fantastic – it reminded me of my own grad school days – when the deadline was the cause for a week of all-nighters and a lot of drama for a big group of us on the top floor of the CS building in Madison.  In the AMPLab, we’ve routinely had flurries of activity around deadlines for major Systems and Machine Learning conferences.   This was the biggest push we’ve had yet in the Database area and the commitment of the students involved was truly remarkable.

If you don’t follow such things, SIGMOD is one of the top international conferences in the Database Systems research field – it requires that papers be anonymized.  Ultimately, fewer than 1 in 5 papers will be accepted and the results aren’t announced until February and April, with the conference being held in June.  It’s a bit frustrating and (I believe) counter-productive to have such a long “quiet period” in a fast-moving field like ours, but those are the rules the community has chosen to play by – at least for the time being.

In any case, I wanted to take the opportunity to give an overview of some of our current database systems work, as it is happening.  Below, I list topics and say a bit about some of the specific projects.  I should mention that while the AMPLab was the hub of this activity, much of this work was done in collaboration with friends and colleagues in industry and academia from the East Coast,  Europe, and Asia.

New Architectures for High Performance Databases

Much has been written in the past couple years about the relative advantages of traditional SQL-based parallel database systems and map-reduce (MR)-based platforms (that may or may not run SQL).  The poor performance of Hadoop MR in many important scenarios has caused many observers to incorrectly conclude that such limitations were somehow fundamental to the MR paradigm.   Recent work on the Shark system has shown how to get MPP database-level performance on an HDFS and MR (using Spark, not Hadoop MR) platform, thereby retaining the fault-tolerance and flexibility benefits of those platforms.

Another project in the lab is exploring policies and implementation techniques for aging cold data from in-memory database systems to disk or other secondary storage.  The idea is to exploit access patterns and application-level semantics in order to maximize the benefit of an in-memory approach.

Performance and Consistency Guarantees with Multitenancy and Big Data

Cloud computing and hosted solutions raise a number of challenges for data system design and several projects in the lab are addressing these challenges.   The PIQL project defined the notion of “Scale Independence” as a core concept for enabling database designs to scale over many orders of magnitude in terms of data size, user population and overall application popularity.

Another source of performance and predictability concerns in modern hosted data environments is multitenancy, where computing and storage resources are shared across a heterogeneous and constantly changing set of separate data-intensive applications.   Among the many challenges of muti-tenant environments is that of efficient placement of data and processing, particularly when fault tolerance is required (and when isn’t it?).  The Mesos system addresses such issues for processing and storage.   In other work we are now moving up the stack addressing multi-tenancy for long-lived data-intensive applications by focusing on tenant placement algorithms in large clusters.

The Durability of Consistency

After years of hearing about how consistency was old fashioned, we are now seeing renewed interest in understanding consistency issues and finding ways to get consistency guarantees when they are needed, whether data is spread across racks of hodge-podge machines or across continents.   Recent work in the MDCC project has shown how to “make the common case fast” by carefully designing a new consistency protocol.   MDCC has also developed a new programming interface to transactional systems, that gives application developers much more information and control, allowing them to build more dynamic and adaptive applications.

The PBS project has also made quite a splash, by proposing a probabilistic view of consistency and showing how various design decisions impact the likelihood that an eventually consistent system will demonstrate anomalous behavior.

Crowdsourced Query and Data Wrangling

With all the advances in scalable data management and machine learning happening in the lab, it’s easy to lose sight of the People part of the AMP agenda.   Fortunately, a number of on-going projects involving crowds and human computation have stared to bear fruit.   Current projects include advanced work on building hybrid human/machine entity resolution systems, the integration of crowdsourcing and active learning techniques, and a new architecture for having people help discover the underlying structure of queries and data.

Analytics and Learning over Streaming Data

We’ve also got a bunch of really interesting stuff going on around streaming data (another favorite topic of mine), but this will have to wait for another post.

The intent of this post was not to be comprehensive, but rather, to give and idea of the range of data management work going on in the lab.   Hopefully others will post updates on other projects and areas of interest.  And of course, we will discussing all of this work and more at our upcoming Winter Research Retreat at Lake Tahoe in January.