The BerkeleyX XSeries on Big Data is Complete!


This is a follow up post to our earlier posts about two freely available Massive Open Online Courses (MOOCs) we offered this summer as part of the BerkeleyX Big Data XSeries. The courses were the result of a collaboration between professors Anthony D. Joseph (UC Berkeley) and Ameet Talwalkar (UCLA) with generous sponsorship by Databricks. We are pleased to report that we have completed the first successful runs of both courses, with highly positive student feedback along with enrollment, engagement, and completion rates that are two to five times the averages for MOOCs.

The first course, CS100.1x Introduction to Big Data with Apache Spark, introduced nearly 76,000 students to data science concepts and showed them how to use Spark to perform large-scale analyses through hands-on programming exercises with real-world datasets. Over 35% of the students were active in the course and the course completion rate was 11%. As an alternative to Honor Code completion certificates, we offered a $50 ID Verified Certificate option and more than 4% of the students chose this option. Over 83% of students enrolled in the Verified Certificate option completed the course.

The second course, CS190.1x Scalable Machine Learning, leveraged Spark to introduce students to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines. This course also had notably high enrollment (50K), completion rate (15%), percentage of Verified Certificate students (6%), and completion rate for verified students (88%). Overall, 1,800 students earned verified certificates for both courses and received a BerkeleyX Big Data XSeries Certificate.  

Stay tuned for announcements about future runs of these and follow-up MOOCs!

AMP Camp 5

Ameet Talwalkar
AMP Camp 5 was a huge success!  Over 200 people participated in this sold-out event, we had over 1800 views of our live stream from over 40 countries, and we received overwhelmingly positive feedback.  In addition to learning about the Berkeley Data Analytics Stack (BDAS), participants were particularly interested in interacting with many of the lead developers of the BDAS software projects, who gave talks about their work and also served as teaching assistants during the hands-on exercises.


This 2-day event provided participants with hands-on experience using BDAS, the set of open-source projects including SparkSparkSQLGraphX, and MLlib/MLbase. For the fifth installment of AMP Camp, we expanded the curriculum to include the newest open-source BDAS projects including TachyonSparkRML Pipelines and ADAM, as well as a variety of research and use-case talks.

AMP Camp 5AMP Camp 5


Details about AMP Camp 5, including slides and videos from the talks as well as all of the training material for the hands-on exercises, are available on the AMP Camp website.

AMP Camp 5
 AMP Camp 5

Shark Jumps Big Data (at SIGMOD)

Reynold Xin

Given all the excitement around Big Data recently, it’s not surprising that some analysts and bloggers have begun claiming that the Big Data meme has “Jumped the Shark”.    In the AMPLab we believe this is far from being true and in fact, we have exactly the opposite perspective – namely, that there remains a need for new data analytics frameworks than can scale to massive data sizes.    For this reason, we have developed the Shark system for query processing on Big Data.

Shark (whose name comes from the combination of Spark and Hive) is a data warehouse system that is compatible with Apache Hive.  Building on AMPLab’s Spark system, Shark provides two chief advantages over Hive: performance gains through both smart query optimization and caching data in a cluster’s memory as well as integration with Spark for iterative machine learning.

We will be doing a demo of the Shark system on a 100-node EC2 cluster this week at the SIGMOD conference in Phoenix.  In the demo we will interactively mine 16 months of Wikipedia hourly traffic logs.  If you are attending the conference, please drop by and say “Hi!”  (Update 5/25/12: Shark Wins Best Demo Award at SIGMOD 2012)


Analytical queries usually focus on a particular subset or time window. For example, a query might run over HTTP logs from the previous month, touching only the (small) dimension tables and a small portion of the fact table. These queries exhibit strong temporal locality, and in many cases, it is plausible to fit the working set into a cluster’s memory. Shark allows users to exploit this temporal locality by caching their working set of data, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as arrays of Java primitives), which has little storage overhead and very efficient for Java garbage collection, yet provides maximum performance (an order of magnitude faster than reading data from disk).

Additionally, Shark employs a number of optimization techniques such as limit push downs and hash-based shuffle, which can provide significant speedups in query processing.

Integration with Spark for Machine Learning

Consider the following program to run k-means clustering in Shark:

// k-means implemented in Spark
def kmeans(points: RDD[Point], k: Int) = {
  // Initialize the centroids.
  clusters = new HashMap[Int, Point]
  for (i <- 0 until k) centroids(i) = Point.random()
  for (i <- 1 until 10) {
    // Assign points to centroids and update centroids.
    clusters = points.groupBy(closestCentroid)
      .map{ (id, points) => (id, points.sum / points.size)

// Use SQL to select the “young” users
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
val featureMatrix = youngUsers.mapRows(extractFeatures)

We allow users to exploit Spark’s inherent efficiency at iterative algorithms by providing a set of APIs that turns a SQL query into a RDD (a collection of records distributed in a cluster’s memory). Using this API, users can use SQL to explore data, while expressing more sophisticated machine learning algorithms in Spark. These machine learning algorithms run on the same set of distributed workers and can share the same distributed memory as the query processor. This enables much more efficient data pipelines and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.

Compatibility with Apache Hive

Shark achieves its compatibility with Hive by reusing Hive code as much as possible. It reuses Hive’s query language, metastore, serializers and deserializers, and user-defined function interfaces. It takes the logical query plan generated by the Hive parser and swaps out the execution engine. In Hive’s case, the execution engine uses Hadoop, so Shark instead generates its own physical plan consisting of operators written using Spark. This means that using Shark, you can run your existing Hive QL queries on your Hive warehouse without modifying the data, while enjoying the benefits of caching, better optimization and the Spark API for machine learning.

Again, we will be showcasing Shark at SIGMOD. Please check it out!

An AMP Blab about some recent system conferences – Part 1: SOSP 2011


I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student – in no way does it represent any official or unanimous AMP Lab position.

Part 1: Symposium on Operating System Principles (SOSP) 2011

A very diverse and high quality technical program, as expected. You can find the proceedings and talk slides/videos at

One high-level reaction I have from the conference is that AMP Lab’s observe-analyze-act design loop position us well to identify emerging technology trends, and design systems with high impact under real life scenarios. Our industry partnerships would also allow us to address engineering concerns beyond the laboratory, thus expedite bilateral knowledge transfer between academia and industry.

One best-paper award went to “A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications”, authored by our friends from Univ. of Wisconsin, Professors Andrea and Remzi Arpaci-Dusseau, as well as their students. The paper did an elaborate study of Apple laptop file traces, and found many pathological behavior. For example, a file “write” actually writes a file multiple times, a file “open” touches a great number of seemingly unrelated files.

Another best-paper award went to “Cells: A Virtual Mobile Smartphone Architecture” from Columbia. This study proposes and implements “virtual phones”, the same idea as virtual machines, for example running a “work phone” and a “home phone” on the same physical device. The talk highlight was a demo of two versions of the Angry Birds game running simultaneously on the same phone.

The audiences-choice best presentation award went to “Atlantis: Robust, Extensible Execution Environments for Web Applications”, a joint work between MSR and Rutgers. The talk very humorously surveyed the defects of current Internet browsers, and proposes an “exokernel browser” architecture in which web applications have the flexibility to define their own execution stack, e.g. markup languages, scripting environments, etc. As expected, the talk catalyzed very entertaining questioning from companies with business interests in the future of web browsers.

Also worthy of highlighting – the session on Security contained three papers, all three have Professor Nickolai Zeldovich on the author list, and all three of high quality. I have not done a thorough historical search, but I’m sure it’s rare that a single author manages to fill a complete session at SOSP.

There was also a very lively discussion on ACM copyright policies during the SIGOPS working dinner. I personally believe it’s vital that we find policies that balances concern about upholding the quality of research, preserving the strength of the research community, and facilitating the sharing of cutting edge knowledge and insights.

My own talk on “Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis” went very well. This is a study that performs an empirical analysis on large scale enterprise storage traces, identify different workloads, and discuss design insights specifically targeted at each workload. The rigorous trace analysis allow us to identify simple, threshold-based storage system optimizations, with high confidence that the optimizations bring concrete benefit under realistic settings. Big thank you to everyone at AMP Lab and our co-authors at NetApp for helping me prepare the talk!

Lisbon travel note 1: If history/food is dear to your heart, you will find it worthwhile to visit the Jerónimos Monastery, and try the Pasteis de Nata sold nearby. This is THE authentic egg tart, originated at the Monastery, and very good for a mid-day sugar-high. I had too many – I felt too happy after eating the first 10, lost count of how many more I ate, and skipped lunch and dinner for that day.