AMP BLAB: The AMPLab Blog

Comparing Large Scale Query Engines

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

A major goal of the AMPLab is to provide high performance for data-intensive algorithms and systems at large scale. Currently, AMP has projects focusing on cluster-scale machine learning, stream processing, declarative analytics, and SQL query processing, all with an eye towards improving performance relative to earlier approaches.

Whenever performance is a stated goal, measuring progress becomes a critical component of innovation. After all, how can one hope to improve that which is not measured [1]?

To meet this need, today we’re releasing a hosted benchmark that compares the performance of several large-scale query engines. Our initial benchmark is based on the well known 2009 Hadoop/RDMBS benchmark by Pavlo et al. It compares the performance of an analytic RDBMS (Redshift) to Hadoop-based SQL query engines (Shark, Hive, and Impala) on a set of relational queries.

The write-up includes performance numbers for a recent version of each framework. An example result, which compares response time of a high-selectivity scan, is given below:

benchmark_1a

Because these frameworks are all undergoing rapid development, we hope this will provide a continually updated yardstick with which to track improvements and optimizations.

To make the benchmark easy to run frequently, we have hosted it entirely in the cloud: We’ve made all datasets publicly available in Amazon’s S3 storage service and included turn-key support for loading the data into hosted clusters. This allows individuals to verify the results or even run their own queries. It is entirely open source, and we invite people to fork it, improve it, and contribute new frameworks and updates. Over time, we hope to extend the workload to include advanced analytics like machine learning and graph processing.

One thing to keep in mind: performance isn’t everything. Capabilities, maturity, and stability are also very important; the best performing system may not be the most useful. Unfortunately, these other characteristics are also harder to measure!

We hope you checkout the benchmark, it is hosted on the AMP website and will be updated regularly.

Jenkins: Our Dutiful Software Butler

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

Jenkins LogoAs the pace of development on the Berkeley Data Analytics Stack (BDAS) increases and the number of contributors grows, we recognize the importance of keeping code quality high. In order to automate testing, we’ve setup Jenkins: a system to monitor and test every checkin to our source repositories.

We’re making our installation of Jenkins available to everyone and welcome any feedback you have on the system. There’s also a link in the footer of our web site to help you find Jenkins in the future.

If you’re an open-source contributor, you’ll notice that Jenkins now tests each github pull request and posts a comment with results (as github user AmplabJenkins). This system will allow us to be more responsive to pull requests and handle the increased volume of development.

What you see now is just the beginning. We currently have automated tests for Spark, Mesos and Tachyon and we’ll be adding more BDAS components soon. In addition, thanks to generous support from Amazon Web Service and VMware, we’ll be running scaling and performance tests on clusters in-house and in the cloud soon. We’ve also had a number of companies offer to do coordinated testing using their infrastructure.

I’d also like to thank Thomas Marshall, Jey Kottalam, Haoyuan Li and Jon Kuroda for the work they did to bring up Jenkins and expand the test coverage for our stack. The AMPLab is committed to turning the research software developed at Berkeley into production-ready software.

 

 

 

 

 

 

 

 

AMP Camp Hands-on Big Data Mini Course now online

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

In this post, I will first summarize AMP Camp Two, the Big Data Training course we recently ran at the Strata Big Data California conference, and then introduce our newly released online Big Data Mini Course.

In February we had the rare opportunity to host AMP Camp Two as part of the O’Reilly Strata Conference on Big Data in Santa Clara, CA.

AMP Camp Two was a full day event consisting of two tutorials. The first tutorial consisted of talks presenting an overview of the Berkeley Data Analytics Stack BDAS), Spark, Spark Streaming, Shark, and Machine Learning algorithms built on Spark. At the second tutorial, we provided each attendee with an EC2 cluster running Spark, Shark, and Spark Streaming, and guided them through a set of hands on exercises analyzing real Wikipedia and Twitter data.


Attendees getting hands-on, analyzing real data with Spark and Shark in our second Strata tutorial

The event was a great success! Both tutorials were packed and we received overwhelmingly positive feedback. Details about AMP Camp Two, including slides from the talks, are archived on the AMP Camp website.

Big Data Mini Course

As a follow-up to AMP Camp Two, we have posted an extensively revised and expanded hands-on AMP Camp Mini Course, which walks you through setting up a cluster on EC2 using your own Amazon credentials, then using Spark and Shark to do ad-hoc analytics on real Wikipedia data, writing a Spark Streaming job to process data collected via the Twitter API, and writing a more advanced machine learning data clustering algorithm. We hope you take the opportunity to learn more about some of the most exciting open-source data analytics tools.

Carat Hits Half a Million Devices

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

A few weeks ago, the Carat Project celebrated our mobile energy diagnosis app having been installed on more than 500,000 devices. Of these devices, 59% were running iOS and 41% were running Android. Tens of thousands of these devices actively use Carat every week.

The project continues to generate excitement. For example, it was recently featured in a piece about mobile battery life on ABC affiliate 7News in Denver. (Click on the Video link in the top left of the article to view the piece that aired; the bit about Carat and the AMP Lab starts at 1:26.)

oliner-7news-caratThe data has a positive story to tell about how well Carat’s predictions match reality and improve users’ battery life. After receiving their first report, Carat users see an average improvement in battery life of 10% after 10 days and of more than 30% after 90 days. These numbers include users who ignore Carat’s recommendations, so the actual improvements may be even higher. As for the accuracy of the battery life improvements Carat projects (e.g., “Kill app X to get 45m ± 5m, with 95% confidence), 95.4% fall within the specified bounds (i.e., when users kill app X, 95.4% of the time the actual improvement they see is between 40m and 50m).

We are working to extend Carat’s diagnosis to provide more information to users, including what device configurations or behaviors might be problematic, and to reduce the latency between installation and receiving personalized reports.

Spark 0.7.0 Released

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

We’re proud to announce the release of Spark 0.7.0, a new major version of Spark that adds several key features, including a Python API for Spark, an alpha of Spark Streaming, and numerous improvements across the board. This release is the result of the largest group of contributors yet behind a Spark release — 31 contributors in total, of which 20 were external to Berkeley. Head over to the release notes to read more about the new features, or download the code.

I’d also like to thank the people who contributed to the release: Mikhail Bautin, Denny Britz, Paul Cavallaro, Tathagata Das, Thomas Dudziak, Harvey Feng, Stephen Haberman, Tyson Hamilton, Mark Hamstra, Michael Heuer, Shane Huang, Andy Konwinski, Ryan LeCompte, Haoyuan Li, Richard McKinley, Sean McNamara, Lee Moon Soo, Fernand Pajot, Nick Pentreath, Andrew Psaltis, Imran Rashid, Charles Reiss, Josh Rosen, Peter Sankauskas, Prashant Sharma, Shivaram Venkataraman, Patrick Wendell, Reynold Xin, Haitao Yao, Matei Zaharia, and Eric Zhang. The growing number of external contributors is a trend we are very proud of at the AMPLab, as our goal is to continue building a leading open source data analytics stack over the next five years. We hope to see even more contributors to Spark in the future!

For Big Data, Moore’s Law Means Better Decisions

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

Data Drives Decisions

Today, more and more organizations collect more and more data, and they do so with one goal in mind: extracting value. In most cases, this value comes in the form of decisions. There are myriad examples of data driving decisions: (1) monitoring network traffic to detect and defend against a cyber attack, (2) using clinical and genomic data to provide personalised medical treatments, (3) mining system logs to diagnose SLA violations and optimize the performance of large scale systems, (4) analyzing user logs and feedback to decide what ads to show.

While making decisions based on this huge volume of data is a big challenge in itself, even more challenging is the fact that the data grows faster than the Moore’s law. According to one recent report, data is expected to grow 64% every year, and some categories of data, such as the data produced by particle accelerators and DNA Sequencers, grow much faster.# This means that, in the future, we will need more resources (e.g., servers) just to make the same decision!

Approximate Answers, Sampling, and Moore’s Law

Our key insight in addressing the challenge of data growing faster than the Moore’s is that decisions do not always need exact answers. For example, to detect an SLA violation of a web service, we do not need to know exactly the value of the quality metric (e.g., response time); we need only to know that this value exceeds the SLA. In many instances, approximate answers are good enough as long as they are small and bounded.

And there is another strong reason why approximate answers make sense. Even if computations are exact, they cannot always guarantee perfect answers. This is simply because data inputs are often noisy. Today, many inputs are either human-generated and uncurated (e.g., user logs, tweets, comments on social networks) or they come from sensors that exhibit inherent measurement errors (e.g., thermometers, scales, DNA sequencers, etc). This means that if we can bound computation error to much smaller than the input error, this won’t affect answer’s quality.

Accommodating approximate answers allows us to run the computations on a sample of the input data, instead of on the complete data. This is important since typically the computation error depends on the sample size, and not on the input size. In particular, the standard error of many computations is inversely proportional to S, where S is the sample size (assuming the sample is i.i.d). This means that the growth of the data is no longer a challenge. Indeed, even if the data grows faster than the Moore’s law, the error of the computation is actually decreasing (see Figure below).

This is simply because Moore’s law allows us to process bigger and bigger samples, which means smaller and smaller errors. In particular, Moore’s law allows us to halve the error every three years.

Thus, accommodating approximate answers, can change the way we can think about big data processing. Indeed, in this context, Moore’s law can allow us to either

  1. Improve the accuracy of the computation;
  2. Compute the same result faster;
  3. Use the extra cycles to improve the analysis, possibly by developing and running new analytics tools, or using more features in the datasets.

Challenges in Approximate Big Data Processing

Of course, supporting error bounded computations comes with its own challenges. Arguably the biggest one is to compute error bounds. While this is relatively easy for aggregate operators, such as sum, count, average, and percentiles, it is far from trivial for arbitrary computations. One possibility would be to use Bootstrap, a very general and powerful technique that allows one to compute the error for functions that are Hadamard differentiable. However, this technique is very expensive (e.g., it requires to run the computations 100-200 times on samples of the same size as the original sample). Furthermore, verifying whether the function implemented by computation is Hadamard differentiable is far from trivial.

Within the AMPLab we are aiming to address these challenges by developing new algorithms and systems. Recently, we have developed Bag of Little Bootstraps (BLB), a new method to asses the quality of an estimator that has similar convergence properties as Bootstrap, but it is far more scalable. Currently, we are developing new diagnosis techniques to verify whether Bootstrap/BLB is “working” for a given computation, data distribution, and sample size. Finally, we are designing and developing BlinkDB, a highly scalable approximation query engine that leverages these techniques, as part of the Berkeley Data Analytics Stack (BDAS) stack.

MLbase: A User-friendly System for Distributed Machine learning

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives.

To address these issues we are building MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

MLbase will ultimately provide functionality to end users for a wide variety of common machine learning tasks: classification, regression, collaborative filtering, and more general exploratory data analysis techniques such as dimensionality reduction, feature selection and data visualization. Moreover, MLbase provides a natural platform for ML researchers to develop novel methods for these tasks. Our vision paper describing MLbase has been accepted to the Conference on Innovative Data Systems Research (CIDR), and we will present MLbase at the NIPS workshop on Big Learning in December and at CIDR in January.  We also have several demos planned in the upcoming months. Please visit our project website for more details and/or to contact us.

The AMPLab is Hiring!

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

Open-source has been a part of Berkeley culture since the 1970’s when Bill Joy assembled the original Berkeley Software Distribution (BSD).  As a reader of this blog, you probably know first-hand the time and effort it takes to create quality open-source software.

Over the last year, the lab has seen exciting growth in the number of users and contributors. In order to keep code quality high, I’ve been hired to build a team of full-time engineers. We have two software engineering positions open immediately. Both positions require strong Linux skills and familiarity with EC2 and git. One position requires experience with one or more of Scala, Java, C++, Hadoop, Hive and NoSQL databases; while the other position will focus on automation where knowledge of scripting, Maven, Jenkins, and rpm/deb packaging is important.

Berkeley has been good to me and I’m certain Berkeley will be good to you.

I come to Berkeley from Cloudera where I worked as a member of the Engineering team during the company’s first four formative years. I met the Cloudera founders through my contacts at Berkeley. In the past, I worked as a Software Engineer at Berkeley doing Grid and cluster research and founded the Ganglia project: a popular monitoring system installed on millions of computers world-wide. I credit the success of Ganglia to the Berkeley focus on long-term design and open-source collaboration.

The AMPLab is an open and collaborative space that has a startup feel. The faculty in the lab have shunned private offices to be more accessible and engaged. You’ll work side-by-side with graduate students that are such prolific designers and engineers; it’s easy to forget they’re working on a Ph.D. and have a full course load. The lab has been  an incubator for an impressive array of software projects like Spark, Shark, Apache Mesos and SNAP just to name a few.

As a member of the team, you’ll get an inside look at the new Big Data innovations our sponsors are working on. Additionally, Silicon Valley startups and companies regularly come to Berkeley for tech talks. You’ll not only be informed and intellectually stimulated; you’ll also have a closet full of the latest tech T-shirts.

The lab draws support from a five-year $10M, NSF “Expeditions in Computing” program grant, announced by the White House as part of their “Big Data” research initiative,  a 4.5 year, $5M Darpa XData contract , and over $7M (to date) from Industry sources. These investments show the faith that both the private and public sector have in the AMPLab to build a comprehensive software stack to meet the new Big Data challenges.

How to Apply

If you apply soon, you’ll be able to join us for our winter retreat at Squaw Valley Ski Resort in January. We’re staying slope-side so bring your snowboard or skis!

Visit jobs.berkeley.edu and search for “AMPLab” in the Basic Job Search form to find the positions. Please feel free to contact me at massie@cs.berkeley.edu, if you have any questions or issues applying.

A Snapshot of Database Research in the AMPLab

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

*** SPOILER ALERT:  If you are an ACM SIGMOD reviewer and are concerned about anonymous “double blind” reviewing, stop reading here***

As the dust settles after a hectic ACM SIGMOD paper submission deadline, I found myself reflecting on the variety, scope and overall coolness of the Database Systems work that has been coming out of the lab at an increasing pace in recent months.  Topics include:

  • scalable query processing and machine learning in (what used to be called) NoSQL systems,
  • crowdsourced data management,
  • consistency and performance guarantees in multi-tenant and multi-data-center environments, and
  • low latency processing of streaming and high-velocity data.

The activity level during the lead-up to the SIGMOD deadline (which will correlate with a spike in our Cloud Computing bills – I’m sure)  was fantastic – it reminded me of my own grad school days – when the deadline was the cause for a week of all-nighters and a lot of drama for a big group of us on the top floor of the CS building in Madison.  In the AMPLab, we’ve routinely had flurries of activity around deadlines for major Systems and Machine Learning conferences.   This was the biggest push we’ve had yet in the Database area and the commitment of the students involved was truly remarkable.

If you don’t follow such things, SIGMOD is one of the top international conferences in the Database Systems research field – it requires that papers be anonymized.  Ultimately, fewer than 1 in 5 papers will be accepted and the results aren’t announced until February and April, with the conference being held in June.  It’s a bit frustrating and (I believe) counter-productive to have such a long “quiet period” in a fast-moving field like ours, but those are the rules the community has chosen to play by – at least for the time being.

In any case, I wanted to take the opportunity to give an overview of some of our current database systems work, as it is happening.  Below, I list topics and say a bit about some of the specific projects.  I should mention that while the AMPLab was the hub of this activity, much of this work was done in collaboration with friends and colleagues in industry and academia from the East Coast,  Europe, and Asia.

New Architectures for High Performance Databases

Much has been written in the past couple years about the relative advantages of traditional SQL-based parallel database systems and map-reduce (MR)-based platforms (that may or may not run SQL).  The poor performance of Hadoop MR in many important scenarios has caused many observers to incorrectly conclude that such limitations were somehow fundamental to the MR paradigm.   Recent work on the Shark system has shown how to get MPP database-level performance on an HDFS and MR (using Spark, not Hadoop MR) platform, thereby retaining the fault-tolerance and flexibility benefits of those platforms.

Another project in the lab is exploring policies and implementation techniques for aging cold data from in-memory database systems to disk or other secondary storage.  The idea is to exploit access patterns and application-level semantics in order to maximize the benefit of an in-memory approach.

Performance and Consistency Guarantees with Multitenancy and Big Data

Cloud computing and hosted solutions raise a number of challenges for data system design and several projects in the lab are addressing these challenges.   The PIQL project defined the notion of “Scale Independence” as a core concept for enabling database designs to scale over many orders of magnitude in terms of data size, user population and overall application popularity.

Another source of performance and predictability concerns in modern hosted data environments is multitenancy, where computing and storage resources are shared across a heterogeneous and constantly changing set of separate data-intensive applications.   Among the many challenges of muti-tenant environments is that of efficient placement of data and processing, particularly when fault tolerance is required (and when isn’t it?).  The Mesos system addresses such issues for processing and storage.   In other work we are now moving up the stack addressing multi-tenancy for long-lived data-intensive applications by focusing on tenant placement algorithms in large clusters.

The Durability of Consistency

After years of hearing about how consistency was old fashioned, we are now seeing renewed interest in understanding consistency issues and finding ways to get consistency guarantees when they are needed, whether data is spread across racks of hodge-podge machines or across continents.   Recent work in the MDCC project has shown how to “make the common case fast” by carefully designing a new consistency protocol.   MDCC has also developed a new programming interface to transactional systems, that gives application developers much more information and control, allowing them to build more dynamic and adaptive applications.

The PBS project has also made quite a splash, by proposing a probabilistic view of consistency and showing how various design decisions impact the likelihood that an eventually consistent system will demonstrate anomalous behavior.

Crowdsourced Query and Data Wrangling

With all the advances in scalable data management and machine learning happening in the lab, it’s easy to lose sight of the People part of the AMP agenda.   Fortunately, a number of on-going projects involving crowds and human computation have stared to bear fruit.   Current projects include advanced work on building hybrid human/machine entity resolution systems, the integration of crowdsourcing and active learning techniques, and a new architecture for having people help discover the underlying structure of queries and data.

Analytics and Learning over Streaming Data

We’ve also got a bunch of really interesting stuff going on around streaming data (another favorite topic of mine), but this will have to wait for another post.

The intent of this post was not to be comprehensive, but rather, to give and idea of the range of data management work going on in the lab.   Hopefully others will post updates on other projects and areas of interest.  And of course, we will discussing all of this work and more at our upcoming Winter Research Retreat at Lake Tahoe in January.

 

Low-Latency SQL at Scale: A Performance Analysis of Shark

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

One of the AMP Lab’s goals is to bring interactive performance, similar to that available in smaller-scale databases, to Big Data systems, while retaining the flexibility and scalability that makes them compelling. To this end, we have developed Shark, a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

Our new tech report, Shark: SQL and Rich Analytics at Scale, describes Shark’s architecture and demonstrates how Shark executes SQL queries from 10 to over 100 times faster than Apache Hive, approaching the performance of parallel RDBMSs, while offering improved fault-tolerance, schema flexibility, and machine-learning integration.

Shark outperforms Hive/Hadoop on both disk- and memory-based datasets. As a sample result, we present Shark’s performance versus Hive/Hadoop on two real SQL queries from an early industrial user and one iteration of a logistic regression classification algorithm (results are for a 100-node EC2 cluster):

Shark Performance Graph

Query Time (seconds)

The full report features several other benchmarks, including micro-benchmarks using TPC-H data and real queries from the aforementioned industrial user.

Several innovative optimizations are behind these performance results, including

  • In-Memory Columnar Storage, which stores data in a compressed, memory-efficient format,
  • Partial DAG Execution, a technique that enables several runtime optimizations, including join algorithm selection and skew handling,
  • Fine-grained fault-tolerance, which allows Shark to recover from mid-query failures within seconds using Spark’s Resilient Distributed Dataset (RDD) abstraction, and
  • Efficient Parallel Joinswhich are necessary when joining two or more large tables.

Looking ahead, we are currently implementing additional optimizations, such as byte-code compilation, which we believe will further improve performance.

Shark challenges the assumption that MapReduce-like systems must come at the cost of performance: it achieves performance competitive with traditional RDBMS systems, while retaining the fault tolerance, schema flexibility, and advanced programming models so core to the big data movement.

Shark is already in use today in a number of datacenters. If want to try Shark, you can find it at http://shark.cs.berkeley.edu.  Also, be sure to join the Shark Users mailing list to join in the conversation and keep up with the latest developments.