AMP BLAB: The AMPLab Blog

Shark 0.2 Released and 0.3 Preview

Reynold Xin

I am happy to announce that the next major release of Shark, 0.2, is now available. Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 30 times faster than Hive without modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements. The full release notes are posted on Shark’s github wiki, but here are some highlights:

  • Hive compatibility: Hive version support is bumped up to 0.9, and UDFs/UDAFs are fully supported and can be distributed to all slaves using the ADD FILE command.
  • Simpler deployment: Shark 0.2 works with Spark 0.6’s standalone deployment mode, which means you can run Shark in cluster mode without depending on Apache Mesos. Also we worked on simplifying the deployment, and you can now set up a single node Shark instance in 5 mins, and launch a cluster on EC2 in 20 mins.
  • Thrift server mode: Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server, which is compatible with Hive’s Thrift server.
  • Performance improvements: We rewrote Shark’s join and group by code and workloads can observe 2X speedup.

In addition to the 0.2 release, we are working on the next major version, 0.3, expected to be released in November. Below is a preview of some of the features:

  • Columnar compression: We are adding fast columnar data compression. You can fit more data into your cluster’s memory without sacrificing query execution speed.
  • Memory Management Dashboard: We are working on a dashboard that shows key characteristics of the cluster, e.g. what tables are in memory versus on disk.
  • Automatic optimizations: Shark will automatically determine the right degree of parallelism and users will not have to worry about setting configuration variables.

 

Spark 0.6.0 Released

Matei Zaharia

I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.

The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:

  • Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
  • Java API: exposes all of Spark’s features to Java developers in a clean manner.
  • Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
  • Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
  • Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.

As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:

  • Tathagata Das contributed the new communication layer, and parts of the storage layer.
  • Haoyuan Li contributed the new storage manager.
  • Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
  • Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
  • Josh Rosen contributed the Java API, as well as several bug fixes.
  • Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
  • Reynold Xin contributed numerous bug and performance fixes.
  • Imran Rashid contributed the new Accumulable class.
  • Harvey Feng contributed improvements to shuffle operations.
  • Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
  • Ravi Pandya contributed Spark run scripts for Windows.
  • Mosharaf Chowdhury provided several fixes to broadcast.
  • Henry Milner pointed out several bugs in sampling algorithms.
  • Ray Racine provided improvements to the EC2 scripts.
  • Paul Ruan and Bill Zhao helped with testing.

We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.

AMP Camp Follow-up and Preview of What’s Next

Andy Konwinski

In August, we hosted the first AMP Camp “Big Data bootcamp” and it was a huge success, with a sold-out auditorium and over 3,000 folks tuning in to watch via live streaming!

However, hosting the AMP Camp was only one part of our plan to engage you and your peers in the Big Data community. Many of you heard about BDAS, the Berkeley Data Analytics Stack (pronounced “bad-ass”), for the first time at AMP Camp. Well, we are busy working on new releases of the existing BDAS components (e.g., Spark 0.6.0 is just around the corner) as well as entirely new components that we will announce soon. Additionally, in the spirit of AMP Camp, we will continue releasing free online Big Data training materials, videos, and tutorials. And of course, we have already begun planning the next AMP Camp.

We archived videos of all of the AMP Camp talks and their corresponding slides; find links to both on the AMP Camp Agenda page. We also published the hands-on Big Data exercises we released during AMP Camp, that walk you through setting up and using the BDAS stack on EC2 to perform a parallel machine learning algorithm (K-Means Clustering) on a real Wikipedia dataset.

During the actual event, we used a Piazza class to answer questions immediately as they were asked. The Piazza class from the first AMP Camp will remain available for archival purposes. If you are not already enrolled in the class (required in order to access the Q&A archive) and would like to be, please let us know by sending an email to ampcamp@cs.berkeley.edu.

Just how successful was the first AMP Camp? Check out some of the event statistics for yourself:

  • 27,594 people visited the AMP Camp site in the 4 day window surrounding the event
  • 4,969 people registered to view AMP Camp live streaming
  • Over 3,000 people tuned into the AMP Camp live stream
  • 1,277 people enrolled in the AMP Camp Piazza class
  • The average AMP Camp teacher/TA response time to questions asked in Piazza was 7 minutes

If you are interested in staying up to date with what we’re working on, join the AMP Mailing list. We look forward to being in touch with you again soon!

Can Social Networks Rapidly Transmit Knowledge?

Ken Goldberg

How many of our friends and neighbors are aware of California’s Proposition 30? Despite the high stakes and its potential impact on everyone from students and parents to business leaders, many Californians are not yet aware of it. Can social networks rapidly transmit knowledge about timely issues like this?

In 2009, the DOD launched an experiment to see if social networks could be mobilized to address time-critical tasks. Within three days, a team from MIT recruited thousands of people to help, and they successfully located the DOD’s 10 red weather balloons within nine hours. One key to their success was their recursive incentive mechanism.  These are well-known in marketing as pyramid schemes, but they are also useful to extract information from social networks. A Nature article published last week shows that social media, in particular messages from close friends, can have a small but significant influence on voting behavior which can make a difference in tight races. Polls show that voters who are aware of Proposition 30 are split at close to 50/50.

Can an intangible incentive structure be designed to rapidly mobilize citizens to learn about a pressing issue? Working with the AMP Lab and the CITRIS Data and Democracy Initiative, we are studying this question with a project to study how activity spreads across populations.

The website for the Proposition 30 Awareness Project allows visitors to register their awareness, then receive a custom web link to share with their friends and family using email, Facebook or Twitter. Visitors can return at any time to monitor their growing influence graph and track their influence score.   Influence is computed using a variant of the Kleinberg and Raghavan algorithm.

This experiment is the first step toward a general-purpose tool that will allow citizens to initiate their own awareness campaigns on any issue. The first example emphasizes awareness of Proposition 30 and does not advocate a position. It includes links to the California Voters Guide and to campaigns on both sides of the issue. Visitors may also indicate their position for or against the proposition and join an online discussion afterward.

We welcome you to participate in the study by visiting: http://tinyurl.com/prop30p

Getting Hands-On With Big Data

Patrick Wendell

One of my favorite things about working in the AMPLAb is our focus on extending impact beyond the academic community and into the “real world”. This coming week, we will host the first UC Berkeley AMP Camp, a practitioner’s guide to cutting-edge large scale data management. The AMP Camp will feature hands-on tutorials on big data analysis using the AMPLab software stack, including Spark, Shark, and Mesos. These tools work hand-in-hand with technologies like Hadoop to provide high performance, low latency data analysis. AMP Camp will also include high level overviews of warehouse scale computing, presentations on several big data use-cases, and talks on related projects in the lab (see the full agenda).

AMP Camp will be live streamed directly from the U.C. Berkeley campus on Tuesday August 21 and Wednesday August 22nd. Participants will be able to engage in real time tutorials and interact with instructors directly through Piazza. Attending the live-streamed AMPCamp is completely free and requires only a brief registration. Around 200 attendees will also participate on-site (unfortunately on-site registration has sold out!).

Modern “big data” infrastructure increasingly consists of open-source projects with steep learning curves. Hands-on instruction is a critical prerequisite for understanding how to use these technologies and tie them together to extract value from massive data sets. Through AMP Camp and our broader community development initiatives, the AMPLab hopes to accelerate adoption of our Berkeley Data Analytics Stack (BDAS) and educate the community at large about data management.

We hope you all register for AMPCamp live stream – see you on the 21st!

Carat Now Available for iOS and Android

Adam Oliner

Carat aims to detect energy bugs—app behavior that is consuming energy unnecessarily—using data collected from a community of mobile devices. As featured on TechCrunch, our mobile app generates personalized recommendations for improving battery life and is now available for iOS and Android devices.

Users install and run the Carat app on their mobile phone, tablet, or music player. The app intermittently takes measurements about the device and how it is being used. These measurements are sent to our servers, which perform a statistical analysis. The analysis, a Spark application running on Amazon Web Services, infers how devices are using energy and under what circumstances. After considering a few days worth of data, the personalized results are sent back to users in the form of actionable advice about how to improve their battery life.

CaratUsing data from our beta deployments, we have already identified thousands of energy bugs manifesting in the wild. Carat not only detects such misbehavior, but assists diagnosis by correlating with device features like the operating system version and what resources are available (e.g., WiFi or GPS). Some of these bugs caused afflicted users to see battery life that was shorter by hours, even compared to other users of the same app (but where the bug did not trigger). Users were told what misbehavior Carat found on their devices and how to fix it, such as by restarting the app or upgrading the OS.

We are excited to finally be live on both Apple’s App Store and Google’s Play Store. Please try out the app, leave positive ratings on the respective stores to help us spread the word, and send us any comments or questions. Thank you!

Shark Jumps Big Data (at SIGMOD)

Reynold Xin

Given all the excitement around Big Data recently, it’s not surprising that some analysts and bloggers have begun claiming that the Big Data meme has “Jumped the Shark”.    In the AMPLab we believe this is far from being true and in fact, we have exactly the opposite perspective – namely, that there remains a need for new data analytics frameworks than can scale to massive data sizes.    For this reason, we have developed the Shark system for query processing on Big Data.

Shark (whose name comes from the combination of Spark and Hive) is a data warehouse system that is compatible with Apache Hive.  Building on AMPLab’s Spark system, Shark provides two chief advantages over Hive: performance gains through both smart query optimization and caching data in a cluster’s memory as well as integration with Spark for iterative machine learning.

We will be doing a demo of the Shark system on a 100-node EC2 cluster this week at the SIGMOD conference in Phoenix.  In the demo we will interactively mine 16 months of Wikipedia hourly traffic logs.  If you are attending the conference, please drop by and say “Hi!”  (Update 5/25/12: Shark Wins Best Demo Award at SIGMOD 2012)

Performance

Analytical queries usually focus on a particular subset or time window. For example, a query might run over HTTP logs from the previous month, touching only the (small) dimension tables and a small portion of the fact table. These queries exhibit strong temporal locality, and in many cases, it is plausible to fit the working set into a cluster’s memory. Shark allows users to exploit this temporal locality by caching their working set of data, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as arrays of Java primitives), which has little storage overhead and very efficient for Java garbage collection, yet provides maximum performance (an order of magnitude faster than reading data from disk).

Additionally, Shark employs a number of optimization techniques such as limit push downs and hash-based shuffle, which can provide significant speedups in query processing.

Integration with Spark for Machine Learning

Consider the following program to run k-means clustering in Shark:

// k-means implemented in Spark
def kmeans(points: RDD[Point], k: Int) = {
  // Initialize the centroids.
  clusters = new HashMap[Int, Point]
  for (i <- 0 until k) centroids(i) = Point.random()
  for (i <- 1 until 10) {
    // Assign points to centroids and update centroids.
    clusters = points.groupBy(closestCentroid)
      .map{ (id, points) => (id, points.sum / points.size)
    }.collectMap()
  }
}

// Use SQL to select the “young” users
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.mapRows(extractFeatures)
kmeans(featureMatrix)

We allow users to exploit Spark’s inherent efficiency at iterative algorithms by providing a set of APIs that turns a SQL query into a RDD (a collection of records distributed in a cluster’s memory). Using this API, users can use SQL to explore data, while expressing more sophisticated machine learning algorithms in Spark. These machine learning algorithms run on the same set of distributed workers and can share the same distributed memory as the query processor. This enables much more efficient data pipelines and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.

Compatibility with Apache Hive

Shark achieves its compatibility with Hive by reusing Hive code as much as possible. It reuses Hive’s query language, metastore, serializers and deserializers, and user-defined function interfaces. It takes the logical query plan generated by the Hive parser and swaps out the execution engine. In Hive’s case, the execution engine uses Hadoop, so Shark instead generates its own physical plan consisting of operators written using Spark. This means that using Shark, you can run your existing Hive QL queries on your Hive warehouse without modifying the data, while enjoying the benefits of caching, better optimization and the Spark API for machine learning.

Again, we will be showcasing Shark at SIGMOD. Please check it out!

An NLP library for Matlab

faridani

Matlab is a great language for prototyping ideas. It comes with many libraries specially for machine learning and statistics. But  when it comes to processing the natural language Matlab is extremely slow. Because of this, many researchers use other languages to pre-process the text, convert the text to numerical data and then bring the resulting data to Matlab for more analysis.

I used to use Java for this. I would usually tokenize the text with Java, then save the resulting matrices to the disk and read them in Matlab. After a while this procedure became cumbersome. I had to go back and forth between Java and Matlab, the procedure is prone to human errors and the codebase just  looks ugly.

Recently, together with Jason Chen, we have started to put together an NLP toolbox for Matlab. It is still a work in progress and we are still developing the toolbox but you can download the latest version from our github repository [link]. There is also an installation guide that helps you properly install it on your machine. I have built a simple map-reduce tool that allows you to utilize all of cores on the CPU for many functions.

So far the toolbox has modules for text tokenization (Bernoulli, Multinomial, tf-idf, n-gram tools), text preprocessing (stop word removal, text cleaning, stemming) and some learning algorithms (linear regression, decision trees, support vector machines and a Naïve Baye’s classifier). we have also implemented evaluation metrics (precession, recall, F1-score and MSE). The support vector machine tool is a wrapper around the famous LibSVM and we are working on another wrapper for SVM-light. A part-of-speech tagger is coming very soon too.

I have been focusing on getting different parts running efficiently. For example, the tokenizer uses Matlab’s native hashmap data structure (container maps) to efficiently pass over the corpus and tokenize it.

We are also adding examples and demos for this toolbox. The first example is a sentiment analysis tool that uses this library to predict whether a movie review is positive or negative. The code reaches the F1 score of 0.83, meaning that out of 200 movie reviews it made a mistake in classifying only 26 of them.

Please try the toolbox and note that it is still a work in progress, some functions are still slow and we are working to improve them. I would love to hear what you think. If you want us to implement something that might be useful to you just let us know.

Connecting Big Data around the World

Tim Kraska

The internet enables millions of users world-wide to create, modify and share data through platforms like Twitter, Facebook, GMail and many other services generating gigantic data sets. This world-wide data access requires replicating the data across multiple data centers, not only to bring the data closer to the user for shorter latencies, but also to increases the availability in case of a data center failure.
However, keeping replicas synchronized and consistent so that a user’s data is never lost and up-to-date, is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Therefore, synchronous wide-area replication has been assumed unfeasible with strong consistency for interactive applications and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, or relaxed consistency, which for example can cause updates to appear and disappear from the application in an unpredictable fashion.

With MDCC (Multi-Data Center Consistency), we describe the first synchronous replication protocol, that does not require a master or static partitioning, and is strongly consistent at a cost similar to eventually consistent protocols by using only a single round-trip across data centers in the normal operational case to apply an update. That is, not only do users get faster response times by locating the data close to them, but also they always experience the same consistency and application behavior regardless of the presence of major failures. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication more effectively. Our evaluation using the TPC-W benchmark, a benchmark simulating a web-shop like Amazon, with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve transaction throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data-center outage without a significant impact on response times, all while guaranteeing strong consistency.

For more information please visit our MDCC web-site.

MDCC was developed by Tim Kraska, Gene Pang, Mike Franklin and Samuel Madden.

News from the First Two Spark User Meetups

Matei Zaharia

One of the neat things about doing research in big data has always been the strong open source culture in this field — many of the widely used software projects are open source, and if you release a new algorithm or tool, there’s a chance that someone will use it. About a year ago, we started making open source releases of Spark, one of the first parallel data processing frameworks in the AMP Lab stack. Spark promises 10-20x faster performance than existing tools thanks to its ability to perform computations in memory, as well as an easy-to-use programming interface in the Scala language.

Ten months later, we’re excited to see how far the Spark community has grown. In particular, it’s passed that threshold where we knew each user personally, to reach a point where we primarily get questions, code contributions, and feature requests from users that we don’t know. Most awesome of those are the ones that start with “I tried Spark and it rocked” before asking a question. This is great not just because it improves the software, but because it’s often let us discover new use cases that we didn’t anticipate, and led to new research. For example, the Orchestra work at SIGCOMM 2011 was partly motivated by users running machine learning algorithms that required large broadcasts.

To keep in touch with the community, we’ve started hosting a regular Spark User Meetup, which is a mix of tutorials, presentations of upcoming features, presentations from users, and Q&A. The first two meetups were held at Klout and Conviva. I wanted to give a quick summary of the topics presented for those who missed them:

January Meetup

  • Matei Zaharia from Berkeley talked about the goals of the Spark project and gave a tutorial on how to set it up locally or on Amazon EC2. The goal was to show people where the various new features we are developing fit in and where to find help on how to use what’s already there. The slides are available online (PPTX).
  • Karthik Thiyagarajan from Quantifind explained how they are starting to use Spark for realtime exploration of time series data. Quantifind is a startup that offers predictive analytics — identifying trends from time series to make decisions. They use Spark as an almost realtime database service, where new data is ingested periodically and can be queried interactively from a web interface. Check out Karthik’s slides for more details. This is one of the use cases we’re really interested in exploring further.

February Meetup

  • This was the first unveiling of Shark, a port of Apache Hive onto Spark that we are developing. Hive is a popular large-scale data warehouse that provides a SQL interface for running queries on Hadoop MapReduce. With Shark, we can run the same queries over cached in-memory data in Spark, leading to up to 10x better performance for interactive data mining. One neat thing about the port is that it’s backwards-compatible with Hive, using the same language, user-defined functions, and metadata store, so it can run seamlessly on existing Hive data. Cliff Engle from the Shark team gave a talk (PPTX) on Shark’s design and some initial results.
  • Dilip Joseph from Conviva talked about their use of Spark for a variety of reporting and analytics applications. Conviva provides video streaming optimization and management systems that need to deliver high-quality live video to thousands of concurrent viewers. They use Spark in combination with Hadoop and Hive to analyze the large sets of resulting logs, compute statistics over the data, and identify problems or optimization opportunities. They were some of the first production users of Spark, and today, they run 30% of their reports in Spark. Check out Dilip’s blog post on the Conviva engineering blog for more details.

Both meetups had full rooms with over 40 people from attending, which is great. We’re hoping to hold the next meetup in the first or second week of April. Please sign up for the meetup.com group if you’re interested.

In other Spark news, a demo paper on Shark, the Hive-on-Spark port we are developing, was accepted at the SIGMOD conference, while a paper on Spark itself will appear at NSDI. Additionally, the Mesos cluster manager developed in our lab, which Spark runs over, called its first Apache release vote today to release version 0.9, a major milestone that contains new usability, fault tolerance, and stability features developed at Twitter. Finally, Spark and Shark talks were accepted at Scala Days 2012 (in London, England) and the Hadoop Summit (in Sunnyvale, CA). If you’re at those conferences and want to learn more about Spark, please drop by!