Shark Jumps Big Data (at SIGMOD)

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

Given all the excitement around Big Data recently, it’s not surprising that some analysts and bloggers have begun claiming that the Big Data meme has “Jumped the Shark”.    In the AMPLab we believe this is far from being true and in fact, we have exactly the opposite perspective – namely, that there remains a need for new data analytics frameworks than can scale to massive data sizes.    For this reason, we have developed the Shark system for query processing on Big Data.

Shark (whose name comes from the combination of Spark and Hive) is a data warehouse system that is compatible with Apache Hive.  Building on AMPLab’s Spark system, Shark provides two chief advantages over Hive: performance gains through both smart query optimization and caching data in a cluster’s memory as well as integration with Spark for iterative machine learning.

We will be doing a demo of the Shark system on a 100-node EC2 cluster this week at the SIGMOD conference in Phoenix.  In the demo we will interactively mine 16 months of Wikipedia hourly traffic logs.  If you are attending the conference, please drop by and say “Hi!”  (Update 5/25/12: Shark Wins Best Demo Award at SIGMOD 2012)

Performance

Analytical queries usually focus on a particular subset or time window. For example, a query might run over HTTP logs from the previous month, touching only the (small) dimension tables and a small portion of the fact table. These queries exhibit strong temporal locality, and in many cases, it is plausible to fit the working set into a cluster’s memory. Shark allows users to exploit this temporal locality by caching their working set of data, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as arrays of Java primitives), which has little storage overhead and very efficient for Java garbage collection, yet provides maximum performance (an order of magnitude faster than reading data from disk).

Additionally, Shark employs a number of optimization techniques such as limit push downs and hash-based shuffle, which can provide significant speedups in query processing.

Integration with Spark for Machine Learning

Consider the following program to run k-means clustering in Shark:

// k-means implemented in Spark
def kmeans(points: RDD[Point], k: Int) = {
  // Initialize the centroids.
  clusters = new HashMap[Int, Point]
  for (i <- 0 until k) centroids(i) = Point.random()
  for (i <- 1 until 10) {
    // Assign points to centroids and update centroids.
    clusters = points.groupBy(closestCentroid)
      .map{ (id, points) => (id, points.sum / points.size)
    }.collectMap()
  }
}

// Use SQL to select the “young” users
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.mapRows(extractFeatures)
kmeans(featureMatrix)

We allow users to exploit Spark’s inherent efficiency at iterative algorithms by providing a set of APIs that turns a SQL query into a RDD (a collection of records distributed in a cluster’s memory). Using this API, users can use SQL to explore data, while expressing more sophisticated machine learning algorithms in Spark. These machine learning algorithms run on the same set of distributed workers and can share the same distributed memory as the query processor. This enables much more efficient data pipelines and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.

Compatibility with Apache Hive

Shark achieves its compatibility with Hive by reusing Hive code as much as possible. It reuses Hive’s query language, metastore, serializers and deserializers, and user-defined function interfaces. It takes the logical query plan generated by the Hive parser and swaps out the execution engine. In Hive’s case, the execution engine uses Hadoop, so Shark instead generates its own physical plan consisting of operators written using Spark. This means that using Shark, you can run your existing Hive QL queries on your Hive warehouse without modifying the data, while enjoying the benefits of caching, better optimization and the Spark API for machine learning.

Again, we will be showcasing Shark at SIGMOD. Please check it out!

An AMP Blab about some recent system conferences – Part 1: SOSP 2011

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student – in no way does it represent any official or unanimous AMP Lab position.

Part 1: Symposium on Operating System Principles (SOSP) 2011

A very diverse and high quality technical program, as expected. You can find the proceedings and talk slides/videos at http://sigops.org/sosp/sosp11/current/index.html.

One high-level reaction I have from the conference is that AMP Lab’s observe-analyze-act design loop position us well to identify emerging technology trends, and design systems with high impact under real life scenarios. Our industry partnerships would also allow us to address engineering concerns beyond the laboratory, thus expedite bilateral knowledge transfer between academia and industry.

One best-paper award went to “A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications”, authored by our friends from Univ. of Wisconsin, Professors Andrea and Remzi Arpaci-Dusseau, as well as their students. The paper did an elaborate study of Apple laptop file traces, and found many pathological behavior. For example, a file “write” actually writes a file multiple times, a file “open” touches a great number of seemingly unrelated files.

Another best-paper award went to “Cells: A Virtual Mobile Smartphone Architecture” from Columbia. This study proposes and implements “virtual phones”, the same idea as virtual machines, for example running a “work phone” and a “home phone” on the same physical device. The talk highlight was a demo of two versions of the Angry Birds game running simultaneously on the same phone.

The audiences-choice best presentation award went to “Atlantis: Robust, Extensible Execution Environments for Web Applications”, a joint work between MSR and Rutgers. The talk very humorously surveyed the defects of current Internet browsers, and proposes an “exokernel browser” architecture in which web applications have the flexibility to define their own execution stack, e.g. markup languages, scripting environments, etc. As expected, the talk catalyzed very entertaining questioning from companies with business interests in the future of web browsers.

Also worthy of highlighting – the session on Security contained three papers, all three have Professor Nickolai Zeldovich on the author list, and all three of high quality. I have not done a thorough historical search, but I’m sure it’s rare that a single author manages to fill a complete session at SOSP.

There was also a very lively discussion on ACM copyright policies during the SIGOPS working dinner. I personally believe it’s vital that we find policies that balances concern about upholding the quality of research, preserving the strength of the research community, and facilitating the sharing of cutting edge knowledge and insights.

My own talk on “Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis” went very well. This is a study that performs an empirical analysis on large scale enterprise storage traces, identify different workloads, and discuss design insights specifically targeted at each workload. The rigorous trace analysis allow us to identify simple, threshold-based storage system optimizations, with high confidence that the optimizations bring concrete benefit under realistic settings. Big thank you to everyone at AMP Lab and our co-authors at NetApp for helping me prepare the talk!

Lisbon travel note 1: If history/food is dear to your heart, you will find it worthwhile to visit the Jerónimos Monastery, and try the Pasteis de Nata sold nearby. This is THE authentic egg tart, originated at the Monastery, and very good for a mid-day sugar-high. I had too many – I felt too happy after eating the first 10, lost count of how many more I ate, and skipped lunch and dinner for that day.