MLbase: A User-friendly System for Distributed Machine learning

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives.

To address these issues we are building MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

MLbase will ultimately provide functionality to end users for a wide variety of common machine learning tasks: classification, regression, collaborative filtering, and more general exploratory data analysis techniques such as dimensionality reduction, feature selection and data visualization. Moreover, MLbase provides a natural platform for ML researchers to develop novel methods for these tasks. Our vision paper describing MLbase has been accepted to the Conference on Innovative Data Systems Research (CIDR), and we will present MLbase at the NIPS workshop on Big Learning in December and at CIDR in January.  We also have several demos planned in the upcoming months. Please visit our project website for more details and/or to contact us.

The AMPLab is Hiring!

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Open-source has been a part of Berkeley culture since the 1970’s when Bill Joy assembled the original Berkeley Software Distribution (BSD).  As a reader of this blog, you probably know first-hand the time and effort it takes to create quality open-source software.

Over the last year, the lab has seen exciting growth in the number of users and contributors. In order to keep code quality high, I’ve been hired to build a team of full-time engineers. We have two software engineering positions open immediately. Both positions require strong Linux skills and familiarity with EC2 and git. One position requires experience with one or more of Scala, Java, C++, Hadoop, Hive and NoSQL databases; while the other position will focus on automation where knowledge of scripting, Maven, Jenkins, and rpm/deb packaging is important.

Berkeley has been good to me and I’m certain Berkeley will be good to you.

I come to Berkeley from Cloudera where I worked as a member of the Engineering team during the company’s first four formative years. I met the Cloudera founders through my contacts at Berkeley. In the past, I worked as a Software Engineer at Berkeley doing Grid and cluster research and founded the Ganglia project: a popular monitoring system installed on millions of computers world-wide. I credit the success of Ganglia to the Berkeley focus on long-term design and open-source collaboration.

The AMPLab is an open and collaborative space that has a startup feel. The faculty in the lab have shunned private offices to be more accessible and engaged. You’ll work side-by-side with graduate students that are such prolific designers and engineers; it’s easy to forget they’re working on a Ph.D. and have a full course load. The lab has been  an incubator for an impressive array of software projects like Spark, Shark, Apache Mesos and SNAP just to name a few.

As a member of the team, you’ll get an inside look at the new Big Data innovations our sponsors are working on. Additionally, Silicon Valley startups and companies regularly come to Berkeley for tech talks. You’ll not only be informed and intellectually stimulated; you’ll also have a closet full of the latest tech T-shirts.

The lab draws support from a five-year $10M, NSF “Expeditions in Computing” program grant, announced by the White House as part of their “Big Data” research initiative,  a 4.5 year, $5M Darpa XData contract , and over $7M (to date) from Industry sources. These investments show the faith that both the private and public sector have in the AMPLab to build a comprehensive software stack to meet the new Big Data challenges.

How to Apply

If you apply soon, you’ll be able to join us for our winter retreat at Squaw Valley Ski Resort in January. We’re staying slope-side so bring your snowboard or skis!

Visit jobs.berkeley.edu and search for “AMPLab” in the Basic Job Search form to find the positions. Please feel free to contact me at massie@cs.berkeley.edu, if you have any questions or issues applying.

A Snapshot of Database Research in the AMPLab

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

*** SPOILER ALERT:  If you are an ACM SIGMOD reviewer and are concerned about anonymous “double blind” reviewing, stop reading here***

As the dust settles after a hectic ACM SIGMOD paper submission deadline, I found myself reflecting on the variety, scope and overall coolness of the Database Systems work that has been coming out of the lab at an increasing pace in recent months.  Topics include:

  • scalable query processing and machine learning in (what used to be called) NoSQL systems,
  • crowdsourced data management,
  • consistency and performance guarantees in multi-tenant and multi-data-center environments, and
  • low latency processing of streaming and high-velocity data.

The activity level during the lead-up to the SIGMOD deadline (which will correlate with a spike in our Cloud Computing bills – I’m sure)  was fantastic – it reminded me of my own grad school days – when the deadline was the cause for a week of all-nighters and a lot of drama for a big group of us on the top floor of the CS building in Madison.  In the AMPLab, we’ve routinely had flurries of activity around deadlines for major Systems and Machine Learning conferences.   This was the biggest push we’ve had yet in the Database area and the commitment of the students involved was truly remarkable.

If you don’t follow such things, SIGMOD is one of the top international conferences in the Database Systems research field – it requires that papers be anonymized.  Ultimately, fewer than 1 in 5 papers will be accepted and the results aren’t announced until February and April, with the conference being held in June.  It’s a bit frustrating and (I believe) counter-productive to have such a long “quiet period” in a fast-moving field like ours, but those are the rules the community has chosen to play by – at least for the time being.

In any case, I wanted to take the opportunity to give an overview of some of our current database systems work, as it is happening.  Below, I list topics and say a bit about some of the specific projects.  I should mention that while the AMPLab was the hub of this activity, much of this work was done in collaboration with friends and colleagues in industry and academia from the East Coast,  Europe, and Asia.

New Architectures for High Performance Databases

Much has been written in the past couple years about the relative advantages of traditional SQL-based parallel database systems and map-reduce (MR)-based platforms (that may or may not run SQL).  The poor performance of Hadoop MR in many important scenarios has caused many observers to incorrectly conclude that such limitations were somehow fundamental to the MR paradigm.   Recent work on the Shark system has shown how to get MPP database-level performance on an HDFS and MR (using Spark, not Hadoop MR) platform, thereby retaining the fault-tolerance and flexibility benefits of those platforms.

Another project in the lab is exploring policies and implementation techniques for aging cold data from in-memory database systems to disk or other secondary storage.  The idea is to exploit access patterns and application-level semantics in order to maximize the benefit of an in-memory approach.

Performance and Consistency Guarantees with Multitenancy and Big Data

Cloud computing and hosted solutions raise a number of challenges for data system design and several projects in the lab are addressing these challenges.   The PIQL project defined the notion of “Scale Independence” as a core concept for enabling database designs to scale over many orders of magnitude in terms of data size, user population and overall application popularity.

Another source of performance and predictability concerns in modern hosted data environments is multitenancy, where computing and storage resources are shared across a heterogeneous and constantly changing set of separate data-intensive applications.   Among the many challenges of muti-tenant environments is that of efficient placement of data and processing, particularly when fault tolerance is required (and when isn’t it?).  The Mesos system addresses such issues for processing and storage.   In other work we are now moving up the stack addressing multi-tenancy for long-lived data-intensive applications by focusing on tenant placement algorithms in large clusters.

The Durability of Consistency

After years of hearing about how consistency was old fashioned, we are now seeing renewed interest in understanding consistency issues and finding ways to get consistency guarantees when they are needed, whether data is spread across racks of hodge-podge machines or across continents.   Recent work in the MDCC project has shown how to “make the common case fast” by carefully designing a new consistency protocol.   MDCC has also developed a new programming interface to transactional systems, that gives application developers much more information and control, allowing them to build more dynamic and adaptive applications.

The PBS project has also made quite a splash, by proposing a probabilistic view of consistency and showing how various design decisions impact the likelihood that an eventually consistent system will demonstrate anomalous behavior.

Crowdsourced Query and Data Wrangling

With all the advances in scalable data management and machine learning happening in the lab, it’s easy to lose sight of the People part of the AMP agenda.   Fortunately, a number of on-going projects involving crowds and human computation have stared to bear fruit.   Current projects include advanced work on building hybrid human/machine entity resolution systems, the integration of crowdsourcing and active learning techniques, and a new architecture for having people help discover the underlying structure of queries and data.

Analytics and Learning over Streaming Data

We’ve also got a bunch of really interesting stuff going on around streaming data (another favorite topic of mine), but this will have to wait for another post.

The intent of this post was not to be comprehensive, but rather, to give and idea of the range of data management work going on in the lab.   Hopefully others will post updates on other projects and areas of interest.  And of course, we will discussing all of this work and more at our upcoming Winter Research Retreat at Lake Tahoe in January.

 

Low-Latency SQL at Scale: A Performance Analysis of Shark

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

One of the AMP Lab’s goals is to bring interactive performance, similar to that available in smaller-scale databases, to Big Data systems, while retaining the flexibility and scalability that makes them compelling. To this end, we have developed Shark, a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

Our new tech report, Shark: SQL and Rich Analytics at Scale, describes Shark’s architecture and demonstrates how Shark executes SQL queries from 10 to over 100 times faster than Apache Hive, approaching the performance of parallel RDBMSs, while offering improved fault-tolerance, schema flexibility, and machine-learning integration.

Shark outperforms Hive/Hadoop on both disk- and memory-based datasets. As a sample result, we present Shark’s performance versus Hive/Hadoop on two real SQL queries from an early industrial user and one iteration of a logistic regression classification algorithm (results are for a 100-node EC2 cluster):

Shark Performance Graph

Query Time (seconds)

The full report features several other benchmarks, including micro-benchmarks using TPC-H data and real queries from the aforementioned industrial user.

Several innovative optimizations are behind these performance results, including

  • In-Memory Columnar Storage, which stores data in a compressed, memory-efficient format,
  • Partial DAG Execution, a technique that enables several runtime optimizations, including join algorithm selection and skew handling,
  • Fine-grained fault-tolerance, which allows Shark to recover from mid-query failures within seconds using Spark’s Resilient Distributed Dataset (RDD) abstraction, and
  • Efficient Parallel Joinswhich are necessary when joining two or more large tables.

Looking ahead, we are currently implementing additional optimizations, such as byte-code compilation, which we believe will further improve performance.

Shark challenges the assumption that MapReduce-like systems must come at the cost of performance: it achieves performance competitive with traditional RDBMS systems, while retaining the fault tolerance, schema flexibility, and advanced programming models so core to the big data movement.

Shark is already in use today in a number of datacenters. If want to try Shark, you can find it at http://shark.cs.berkeley.edu.  Also, be sure to join the Shark Users mailing list to join in the conversation and keep up with the latest developments.

Shark 0.2 Released and 0.3 Preview

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

I am happy to announce that the next major release of Shark, 0.2, is now available. Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 30 times faster than Hive without modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements. The full release notes are posted on Shark’s github wiki, but here are some highlights:

  • Hive compatibility: Hive version support is bumped up to 0.9, and UDFs/UDAFs are fully supported and can be distributed to all slaves using the ADD FILE command.
  • Simpler deployment: Shark 0.2 works with Spark 0.6’s standalone deployment mode, which means you can run Shark in cluster mode without depending on Apache Mesos. Also we worked on simplifying the deployment, and you can now set up a single node Shark instance in 5 mins, and launch a cluster on EC2 in 20 mins.
  • Thrift server mode: Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server, which is compatible with Hive’s Thrift server.
  • Performance improvements: We rewrote Shark’s join and group by code and workloads can observe 2X speedup.

In addition to the 0.2 release, we are working on the next major version, 0.3, expected to be released in November. Below is a preview of some of the features:

  • Columnar compression: We are adding fast columnar data compression. You can fit more data into your cluster’s memory without sacrificing query execution speed.
  • Memory Management Dashboard: We are working on a dashboard that shows key characteristics of the cluster, e.g. what tables are in memory versus on disk.
  • Automatic optimizations: Shark will automatically determine the right degree of parallelism and users will not have to worry about setting configuration variables.

 

Spark 0.6.0 Released

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.

The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:

  • Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
  • Java API: exposes all of Spark’s features to Java developers in a clean manner.
  • Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
  • Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
  • Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.

As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:

  • Tathagata Das contributed the new communication layer, and parts of the storage layer.
  • Haoyuan Li contributed the new storage manager.
  • Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
  • Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
  • Josh Rosen contributed the Java API, as well as several bug fixes.
  • Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
  • Reynold Xin contributed numerous bug and performance fixes.
  • Imran Rashid contributed the new Accumulable class.
  • Harvey Feng contributed improvements to shuffle operations.
  • Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
  • Ravi Pandya contributed Spark run scripts for Windows.
  • Mosharaf Chowdhury provided several fixes to broadcast.
  • Henry Milner pointed out several bugs in sampling algorithms.
  • Ray Racine provided improvements to the EC2 scripts.
  • Paul Ruan and Bill Zhao helped with testing.

We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.

AMP Camp Follow-up and Preview of What’s Next

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

In August, we hosted the first AMP Camp “Big Data bootcamp” and it was a huge success, with a sold-out auditorium and over 3,000 folks tuning in to watch via live streaming!

However, hosting the AMP Camp was only one part of our plan to engage you and your peers in the Big Data community. Many of you heard about BDAS, the Berkeley Data Analytics Stack (pronounced “bad-ass”), for the first time at AMP Camp. Well, we are busy working on new releases of the existing BDAS components (e.g., Spark 0.6.0 is just around the corner) as well as entirely new components that we will announce soon. Additionally, in the spirit of AMP Camp, we will continue releasing free online Big Data training materials, videos, and tutorials. And of course, we have already begun planning the next AMP Camp.

We archived videos of all of the AMP Camp talks and their corresponding slides; find links to both on the AMP Camp Agenda page. We also published the hands-on Big Data exercises we released during AMP Camp, that walk you through setting up and using the BDAS stack on EC2 to perform a parallel machine learning algorithm (K-Means Clustering) on a real Wikipedia dataset.

During the actual event, we used a Piazza class to answer questions immediately as they were asked. The Piazza class from the first AMP Camp will remain available for archival purposes. If you are not already enrolled in the class (required in order to access the Q&A archive) and would like to be, please let us know by sending an email to ampcamp@cs.berkeley.edu.

Just how successful was the first AMP Camp? Check out some of the event statistics for yourself:

  • 27,594 people visited the AMP Camp site in the 4 day window surrounding the event
  • 4,969 people registered to view AMP Camp live streaming
  • Over 3,000 people tuned into the AMP Camp live stream
  • 1,277 people enrolled in the AMP Camp Piazza class
  • The average AMP Camp teacher/TA response time to questions asked in Piazza was 7 minutes

If you are interested in staying up to date with what we’re working on, join the AMP Mailing list. We look forward to being in touch with you again soon!

Can Social Networks Rapidly Transmit Knowledge?

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

How many of our friends and neighbors are aware of California’s Proposition 30? Despite the high stakes and its potential impact on everyone from students and parents to business leaders, many Californians are not yet aware of it. Can social networks rapidly transmit knowledge about timely issues like this?

In 2009, the DOD launched an experiment to see if social networks could be mobilized to address time-critical tasks. Within three days, a team from MIT recruited thousands of people to help, and they successfully located the DOD’s 10 red weather balloons within nine hours. One key to their success was their recursive incentive mechanism.  These are well-known in marketing as pyramid schemes, but they are also useful to extract information from social networks. A Nature article published last week shows that social media, in particular messages from close friends, can have a small but significant influence on voting behavior which can make a difference in tight races. Polls show that voters who are aware of Proposition 30 are split at close to 50/50.

Can an intangible incentive structure be designed to rapidly mobilize citizens to learn about a pressing issue? Working with the AMP Lab and the CITRIS Data and Democracy Initiative, we are studying this question with a project to study how activity spreads across populations.

The website for the Proposition 30 Awareness Project allows visitors to register their awareness, then receive a custom web link to share with their friends and family using email, Facebook or Twitter. Visitors can return at any time to monitor their growing influence graph and track their influence score.   Influence is computed using a variant of the Kleinberg and Raghavan algorithm.

This experiment is the first step toward a general-purpose tool that will allow citizens to initiate their own awareness campaigns on any issue. The first example emphasizes awareness of Proposition 30 and does not advocate a position. It includes links to the California Voters Guide and to campaigns on both sides of the issue. Visitors may also indicate their position for or against the proposition and join an online discussion afterward.

We welcome you to participate in the study by visiting: http://tinyurl.com/prop30p

Getting Hands-On With Big Data

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

One of my favorite things about working in the AMPLAb is our focus on extending impact beyond the academic community and into the “real world”. This coming week, we will host the first UC Berkeley AMP Camp, a practitioner’s guide to cutting-edge large scale data management. The AMP Camp will feature hands-on tutorials on big data analysis using the AMPLab software stack, including Spark, Shark, and Mesos. These tools work hand-in-hand with technologies like Hadoop to provide high performance, low latency data analysis. AMP Camp will also include high level overviews of warehouse scale computing, presentations on several big data use-cases, and talks on related projects in the lab (see the full agenda).

AMP Camp will be live streamed directly from the U.C. Berkeley campus on Tuesday August 21 and Wednesday August 22nd. Participants will be able to engage in real time tutorials and interact with instructors directly through Piazza. Attending the live-streamed AMPCamp is completely free and requires only a brief registration. Around 200 attendees will also participate on-site (unfortunately on-site registration has sold out!).

Modern “big data” infrastructure increasingly consists of open-source projects with steep learning curves. Hands-on instruction is a critical prerequisite for understanding how to use these technologies and tie them together to extract value from massive data sets. Through AMP Camp and our broader community development initiatives, the AMPLab hopes to accelerate adoption of our Berkeley Data Analytics Stack (BDAS) and educate the community at large about data management.

We hope you all register for AMPCamp live stream – see you on the 21st!

Carat Now Available for iOS and Android

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Carat aims to detect energy bugs—app behavior that is consuming energy unnecessarily—using data collected from a community of mobile devices. As featured on TechCrunch, our mobile app generates personalized recommendations for improving battery life and is now available for iOS and Android devices.

Users install and run the Carat app on their mobile phone, tablet, or music player. The app intermittently takes measurements about the device and how it is being used. These measurements are sent to our servers, which perform a statistical analysis. The analysis, a Spark application running on Amazon Web Services, infers how devices are using energy and under what circumstances. After considering a few days worth of data, the personalized results are sent back to users in the form of actionable advice about how to improve their battery life.

CaratUsing data from our beta deployments, we have already identified thousands of energy bugs manifesting in the wild. Carat not only detects such misbehavior, but assists diagnosis by correlating with device features like the operating system version and what resources are available (e.g., WiFi or GPS). Some of these bugs caused afflicted users to see battery life that was shorter by hours, even compared to other users of the same app (but where the bug did not trigger). Users were told what misbehavior Carat found on their devices and how to fix it, such as by restarting the app or upgrading the OS.

We are excited to finally be live on both Apple’s App Store and Google’s Play Store. Please try out the app, leave positive ratings on the respective stores to help us spread the word, and send us any comments or questions. Thank you!