MLbase: A User-friendly System for Distributed Machine learning

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives.

To address these issues we are building MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

MLbase will ultimately provide functionality to end users for a wide variety of common machine learning tasks: classification, regression, collaborative filtering, and more general exploratory data analysis techniques such as dimensionality reduction, feature selection and data visualization. Moreover, MLbase provides a natural platform for ML researchers to develop novel methods for these tasks. Our vision paper describing MLbase has been accepted to the Conference on Innovative Data Systems Research (CIDR), and we will present MLbase at the NIPS workshop on Big Learning in December and at CIDR in January.  We also have several demos planned in the upcoming months. Please visit our project website for more details and/or to contact us.

The AMPLab is Hiring!

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Open-source has been a part of Berkeley culture since the 1970’s when Bill Joy assembled the original Berkeley Software Distribution (BSD).  As a reader of this blog, you probably know first-hand the time and effort it takes to create quality open-source software.

Over the last year, the lab has seen exciting growth in the number of users and contributors. In order to keep code quality high, I’ve been hired to build a team of full-time engineers. We have two software engineering positions open immediately. Both positions require strong Linux skills and familiarity with EC2 and git. One position requires experience with one or more of Scala, Java, C++, Hadoop, Hive and NoSQL databases; while the other position will focus on automation where knowledge of scripting, Maven, Jenkins, and rpm/deb packaging is important.

Berkeley has been good to me and I’m certain Berkeley will be good to you.

I come to Berkeley from Cloudera where I worked as a member of the Engineering team during the company’s first four formative years. I met the Cloudera founders through my contacts at Berkeley. In the past, I worked as a Software Engineer at Berkeley doing Grid and cluster research and founded the Ganglia project: a popular monitoring system installed on millions of computers world-wide. I credit the success of Ganglia to the Berkeley focus on long-term design and open-source collaboration.

The AMPLab is an open and collaborative space that has a startup feel. The faculty in the lab have shunned private offices to be more accessible and engaged. You’ll work side-by-side with graduate students that are such prolific designers and engineers; it’s easy to forget they’re working on a Ph.D. and have a full course load. The lab has been  an incubator for an impressive array of software projects like Spark, Shark, Apache Mesos and SNAP just to name a few.

As a member of the team, you’ll get an inside look at the new Big Data innovations our sponsors are working on. Additionally, Silicon Valley startups and companies regularly come to Berkeley for tech talks. You’ll not only be informed and intellectually stimulated; you’ll also have a closet full of the latest tech T-shirts.

The lab draws support from a five-year $10M, NSF “Expeditions in Computing” program grant, announced by the White House as part of their “Big Data” research initiative,  a 4.5 year, $5M Darpa XData contract , and over $7M (to date) from Industry sources. These investments show the faith that both the private and public sector have in the AMPLab to build a comprehensive software stack to meet the new Big Data challenges.

How to Apply

If you apply soon, you’ll be able to join us for our winter retreat at Squaw Valley Ski Resort in January. We’re staying slope-side so bring your snowboard or skis!

Visit jobs.berkeley.edu and search for “AMPLab” in the Basic Job Search form to find the positions. Please feel free to contact me at massie@cs.berkeley.edu, if you have any questions or issues applying.

A Snapshot of Database Research in the AMPLab

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

*** SPOILER ALERT:  If you are an ACM SIGMOD reviewer and are concerned about anonymous “double blind” reviewing, stop reading here***

As the dust settles after a hectic ACM SIGMOD paper submission deadline, I found myself reflecting on the variety, scope and overall coolness of the Database Systems work that has been coming out of the lab at an increasing pace in recent months.  Topics include:

  • scalable query processing and machine learning in (what used to be called) NoSQL systems,
  • crowdsourced data management,
  • consistency and performance guarantees in multi-tenant and multi-data-center environments, and
  • low latency processing of streaming and high-velocity data.

The activity level during the lead-up to the SIGMOD deadline (which will correlate with a spike in our Cloud Computing bills – I’m sure)  was fantastic – it reminded me of my own grad school days – when the deadline was the cause for a week of all-nighters and a lot of drama for a big group of us on the top floor of the CS building in Madison.  In the AMPLab, we’ve routinely had flurries of activity around deadlines for major Systems and Machine Learning conferences.   This was the biggest push we’ve had yet in the Database area and the commitment of the students involved was truly remarkable.

If you don’t follow such things, SIGMOD is one of the top international conferences in the Database Systems research field – it requires that papers be anonymized.  Ultimately, fewer than 1 in 5 papers will be accepted and the results aren’t announced until February and April, with the conference being held in June.  It’s a bit frustrating and (I believe) counter-productive to have such a long “quiet period” in a fast-moving field like ours, but those are the rules the community has chosen to play by – at least for the time being.

In any case, I wanted to take the opportunity to give an overview of some of our current database systems work, as it is happening.  Below, I list topics and say a bit about some of the specific projects.  I should mention that while the AMPLab was the hub of this activity, much of this work was done in collaboration with friends and colleagues in industry and academia from the East Coast,  Europe, and Asia.

New Architectures for High Performance Databases

Much has been written in the past couple years about the relative advantages of traditional SQL-based parallel database systems and map-reduce (MR)-based platforms (that may or may not run SQL).  The poor performance of Hadoop MR in many important scenarios has caused many observers to incorrectly conclude that such limitations were somehow fundamental to the MR paradigm.   Recent work on the Shark system has shown how to get MPP database-level performance on an HDFS and MR (using Spark, not Hadoop MR) platform, thereby retaining the fault-tolerance and flexibility benefits of those platforms.

Another project in the lab is exploring policies and implementation techniques for aging cold data from in-memory database systems to disk or other secondary storage.  The idea is to exploit access patterns and application-level semantics in order to maximize the benefit of an in-memory approach.

Performance and Consistency Guarantees with Multitenancy and Big Data

Cloud computing and hosted solutions raise a number of challenges for data system design and several projects in the lab are addressing these challenges.   The PIQL project defined the notion of “Scale Independence” as a core concept for enabling database designs to scale over many orders of magnitude in terms of data size, user population and overall application popularity.

Another source of performance and predictability concerns in modern hosted data environments is multitenancy, where computing and storage resources are shared across a heterogeneous and constantly changing set of separate data-intensive applications.   Among the many challenges of muti-tenant environments is that of efficient placement of data and processing, particularly when fault tolerance is required (and when isn’t it?).  The Mesos system addresses such issues for processing and storage.   In other work we are now moving up the stack addressing multi-tenancy for long-lived data-intensive applications by focusing on tenant placement algorithms in large clusters.

The Durability of Consistency

After years of hearing about how consistency was old fashioned, we are now seeing renewed interest in understanding consistency issues and finding ways to get consistency guarantees when they are needed, whether data is spread across racks of hodge-podge machines or across continents.   Recent work in the MDCC project has shown how to “make the common case fast” by carefully designing a new consistency protocol.   MDCC has also developed a new programming interface to transactional systems, that gives application developers much more information and control, allowing them to build more dynamic and adaptive applications.

The PBS project has also made quite a splash, by proposing a probabilistic view of consistency and showing how various design decisions impact the likelihood that an eventually consistent system will demonstrate anomalous behavior.

Crowdsourced Query and Data Wrangling

With all the advances in scalable data management and machine learning happening in the lab, it’s easy to lose sight of the People part of the AMP agenda.   Fortunately, a number of on-going projects involving crowds and human computation have stared to bear fruit.   Current projects include advanced work on building hybrid human/machine entity resolution systems, the integration of crowdsourcing and active learning techniques, and a new architecture for having people help discover the underlying structure of queries and data.

Analytics and Learning over Streaming Data

We’ve also got a bunch of really interesting stuff going on around streaming data (another favorite topic of mine), but this will have to wait for another post.

The intent of this post was not to be comprehensive, but rather, to give and idea of the range of data management work going on in the lab.   Hopefully others will post updates on other projects and areas of interest.  And of course, we will discussing all of this work and more at our upcoming Winter Research Retreat at Lake Tahoe in January.