Technical Preview of Apache Spark 2.0: Easier, Faster, and Smarter

Reynold Xin

This is a guest blog post originally published on the Databricks blog.

 

For the past few months, we have been busy working on the next major release of the big data open source software we love: Apache Spark 2.0. Since Spark 1.0 came out two years ago, we have heard praises and complaints. Spark 2.0 builds on what we have learned in the past two years, doubling down on what users love and improving on what users lament. While this blog summarizes the three major thrusts and themes—easier, faster, and smarter—that comprise Spark 2.0, the themes highlighted here deserve deep-dive discussions that we will follow up with in-depth blogs in the next few weeks.

Prior to the general release, a technical preview of Apache Spark 2.0 is available on Databricks. This preview package is built using the upstream branch-2.0. Using the preview package is as simple as selecting the “2.0 (branch preview)” version when launching a cluster.

Whereas the final Apache Spark 2.0 release is still a few weeks away, this technical preview is intended to provide early access to the features in Spark 2.0 based on the upstream codebase. This way, you can satisfy your curiosity to try the shiny new toy, while we get feedback and bug reports early before the final release.

Now, let’s take a look at the new developments.

Spark 2.0: Easier, Faster, Smarter

Easier: SQL and Streamlined APIs

One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.

On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.

On the programming API side, we have streamlined the APIs:

  • Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.
  • SparkSession: a new entry point that replaces the old SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, as demonstrated in this notebook. Note that the old SQLContext and HiveContext are still kept for backward compatibility.
  • Simpler, more performant Accumulator API: We have designed a new Accumulator API that has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility
  • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.
  • Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark.
  • Distributed algorithms in R: Added support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.

Faster: Spark as a Compiler

According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Spark. As a result, performance optimizations have always been a focus in our Spark development. Before we started planning for Spark 2.0, we asked ourselves a question: Spark is already pretty fast, but can we push the boundary and make Spark 10X faster?

This question led us to fundamentally rethink the way we build Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers.

Spark 2.0 ships with the second generation Tungsten engine. This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing. The main idea is to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. We call this technique “whole-stage code generation.”

To give you a teaser, we have measured the amount of time (in nanoseconds) it would take to process a row on one core for some of the operators in Spark 1.6 vs. Spark 2.0, and the table below is a comparison that demonstrates the power of the new Tungsten engine. Spark 1.6 includes expression code generation technique that is also in use in some state-of-the-art commercial databases today. As you can see, many of the core operators are becoming an order of magnitude faster with whole-stage code generation.

You can see the power of whole-stage code generation in action in this notebook, in which we perform aggregations and joins on 1 billion records on a single machine.

cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15ns 1.1ns
sum w/o group 14ns 0.9ns
sum w/ group 79ns 10.7ns
hash join 115ns 4.0ns
sort (8-bit entropy) 620ns 5.3ns
sort (64-bit entropy) 620ns 40ns
sort-merge join 750ns 700ns

How does this new engine work on end-to-end queries? We did some preliminary analysis using TPC-DS queries to compare Spark 1.6 and Spark 2.0:

Preliminary TPC-DS Spark 2.0 vs 1.6

Beyond whole-stage code generation to improve performance, a lot of work has also gone into improving the Catalyst optimizer for general query optimizations such as nullability propagation, as well as a new vectorized Parquet decoder that has improved Parquet scan throughput by 3X.

Smarter: Structured Streaming

Spark Streaming has long led the big data space as one of the first attempts at unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7, it offered developers with several powerful properties: exactly-once semantics, fault-tolerance at scale, and high throughput.

However, after working with hundreds of real-world deployments of Spark Streaming, we found that applications that need to make decisions in real-time often require more than just a streaming engine. They require deep integration of the batch stack and the streaming stack, integration with external storage systems, as well as the ability to cope with changes in business logic. As a result, enterprises want more than just a streaming engine; instead they need a full stack that enables them to develop end-to-end “continuous applications.”

One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.

A number of problems exist with this single model. First, operating on data as it arrives in can be very difficult and restrictive. Second, varying data distribution, changing business logic, and delayed data—all add unique challenges. And third, most existing systems, such as MySQL or Amazon S3, do not behave like a stream and many algorithms (including most off-the-shelf machine learning) do not work in a streaming setting.

Spark 2.0’s Structured Streaming APIs is a novel way to approach streaming. It stems from the realization that the simplest way to compute answers on streams of data is to not having to reason about the fact that it is a stream. This realization came from our experience with programmers who already know how to program static data sets (aka batch) using Spark’s powerful DataFrame/Dataset API. The vision of Structured Streaming is to utilize the Catalyst optimizer to discover when it is possible to transparently turn a static program into an incremental execution that works on dynamic, infinite data (aka a stream). When viewed through this structured lens of data—as discrete table or an infinite table—you simplify streaming.

As the first step towards realizing this vision, Spark 2.0 ships with an initial version of the Structured Streaming API, a (surprisingly small!) extension to the DataFrame/Dataset API. This unification should make adoption easy for existing Spark users, allowing them to leverage their knowledge of Spark batch API to answer new questions in real-time. Key features here will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.

Streaming is clearly a pretty broad topic, so stay tuned to this blog for more details on Structured Streaming in Spark 2.0, including details on what is possible in this release and what is on the roadmap for the near future.

Conclusion

Spark users initially came to Spark for its ease-of-use and performance. Spark 2.0 doubles down on these while extending it to support an even wider range of workloads. We hope you will enjoy the work we have put it in, and look forward to your feedback.

Of course, until the upstream Apache Spark 2.0 release is finalized, we do not recommend fully migrating any production workload onto this preview package. This technical preview version is now available on Databricks. You can sign up for an account here.

Read More

If you missed our webinar for Spark 2.0: Easier, Faster, and Smarter, you can register and watch the recordings and download slides and attached notebooks.

You can also import the following notebooks and try on Databricks Community Edition with Spark 2.0 Technical Preview.

Succinct on Apache Spark: Queries on Compressed RDDs

Rachit Agarwal

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random access) directly on a compressed representation of the input data. We are very excited to release Succinct as an Apache Spark package, that enables search, count, range and random access queries on compressed RDDs. This release allows users to use Apache Spark as a document store (with search on documents) similar to ElasticSearch, a key value interface (with search on values) similar to HyperDex, and an experimental DataFrame interface (with search along columns in a table). When used as a document store, Apache Spark with Succinct is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, and over 75x faster than native Apache Spark.

Succinct on Apache Spark Overview

Search is becoming an increasingly powerful primitive in big data analytics and web services. Many web services support some form of search, including LinkedIn searchTwitter search, Facebook search, Netflix search, airlines, hotels, as well as services specifically built around search — Google, Bing, Yelp, to name a few. Apache Spark supports search via full RDD scans. While fast enough for small datasets, data scans become inefficient as dataset become even moderately large. One way to avoid data scans is to implement indexes, but can significantly increase the memory overhead.

We are very excited to announce the release of Succinct as an Apache Spark package, that achieves a unique tradeoff — storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques. Succinct on Apache Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs. What differentiates Succinct on Apache Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression — all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD. 

In addition, Succinct on Apache Spark supports random access of records without scanning the entire RDD, a functionality that we believe will significantly speed up a large number of applications.

An example

Consider a collection of Wikipedia articles stored on HDFS as a flat unstructured file. Let us see how Succinct on Apache Spark supports the above functionalities:

// Import relevant Succinct classes
import edu.berkeley.cs.succinct._ 

// Read an RDD as a collection of articles; sc is the SparkContext
val articlesRDD = ctx.textFile("/path/to/data").map(_.getBytes)

// Compress the input RDD into a Succinct RDD, and persist it in memory
// Note that this is a time consuming step (usually at 8GB/hour/core) since data needs to be compressed. 
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succcinct.persist()

// SuccinctRDD supports a set of powerful primitives directly on compressed RDD
// Let us start by counting the number of occurrences of "Berkeley" across all Wikipedia articles
val count = succinctRDD.count("Berkeley")

// Now suppose we want to find all offsets in the collection at which “Berkeley” occurs; and 
// create an RDD containing all resulting offsets 
val offsetsRDD = succinctRDD.search("Berkeley")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the occurrences of “Berkeley”
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

Many more examples on using Succinct on Apache Spark are outlined here.

Performance

kv-document-search-2

The figure compares the search performance of Apache Spark with Succinct against ElasticSearch and native Apache Spark. We use a 40GB collection of Wikipedia documents over a 4-server Amazon EC2 cluster with 120GB RAM (so that all systems fit in memory). The search queries use words with varying number of occurrences (1–10,000) with uniform random distribution across 10 bins (1–1000, 1000-2000, etc). Note that the y-axis is on log scale.

Interestingly, Apache Spark with Succinct is roughly 2.75x faster than Elasticsearch. This is when ElasticSearch does not have the overhead of Apache Spark’s job execution, and have all the data fit in memory. Succinct achieves this speed up while requiring roughly 2.5x lower memory than ElasticSearch (due to compression, and due to storing no additional indexes)! Succinct on Apache Spark is over two orders of magnitude faster than Apache Spark’s native RDDs due to avoiding data scans. Random access on documents has similar performance gains (with some caveats).

Below, we describe a few interesting use cases for Succinct on Apache Spark, including a number of interfaces exposed in the release. For more details on the release (and Succinct in general), usage and benchmark results, please see Succinct webpage, the NSDI paper, or a more detailed technical report.

Succinct on Apache Spark: Abstractions and use cases

Succinct on Apache Spark exposes three interfaces, each of which may have several interesting use cases. We outline some of them below:

  • SuccinctRDD
    • Interface: Flat (unstructured) files
    • Example application: log analytics
    • Example: one can search across logs (e.g., errors for debugging), or perform random access (e.g., extract logs at certain timestamps).
    • System with similar functionality: Lucene
  • SuccinctKVRDD

    • Interface: Semi-structured data
    • Example application: document stores, key-value stores
    • Example: 
      • (document stores) search across a collection of Wikipedia documents and return all documents that contain, say, string “University of California at Berkeley”. Extract all (or a part of) documents.
      • (key-value stores) search across a set of tweets stored in a key-value store for tweets that contain “Succinct”. Extract all tweets from the user “_ragarwal_”.
    • System with similar functionality: ElasticSearch
  • (An experimental) DataFrame interface
    • Interface: Search and random access on structured data like tables
    • Example applications: point queries on columnar stores
    • Example: given a table with schema {userID, location, date-of-birth, salary, ..}, find all users who were born between 1980 and 1985.
    • Caveat: We are currently working on some very exciting projects to support a number of additional SQL operators efficiently directly on compressed RDDs.

When not to use Succinct on Apache Spark

There are a few applications that are not suitable for Succinct on Apache Spark — long sequential reads, and search for strings that occur very frequently (you may not want to search for “a” or “the”). We outline the associated tradeoffs on Succinct webpage as well.

Looking Ahead

We at AMPLab are working on several interesting projects to make Succinct on Apache Spark more memory efficient, faster and more expressive. To give you an idea about what is next, we are going to close this post with a hint on our next post: executing Regular Expression queries directly on compressed RDDs. Stay tuned!

The BerkeleyX XSeries on Big Data is Complete!

Anthony

This is a follow up post to our earlier posts about two freely available Massive Open Online Courses (MOOCs) we offered this summer as part of the BerkeleyX Big Data XSeries. The courses were the result of a collaboration between professors Anthony D. Joseph (UC Berkeley) and Ameet Talwalkar (UCLA) with generous sponsorship by Databricks. We are pleased to report that we have completed the first successful runs of both courses, with highly positive student feedback along with enrollment, engagement, and completion rates that are two to five times the averages for MOOCs.

The first course, CS100.1x Introduction to Big Data with Apache Spark, introduced nearly 76,000 students to data science concepts and showed them how to use Spark to perform large-scale analyses through hands-on programming exercises with real-world datasets. Over 35% of the students were active in the course and the course completion rate was 11%. As an alternative to Honor Code completion certificates, we offered a $50 ID Verified Certificate option and more than 4% of the students chose this option. Over 83% of students enrolled in the Verified Certificate option completed the course.

The second course, CS190.1x Scalable Machine Learning, leveraged Spark to introduce students to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines. This course also had notably high enrollment (50K), completion rate (15%), percentage of Verified Certificate students (6%), and completion rate for verified students (88%). Overall, 1,800 students earned verified certificates for both courses and received a BerkeleyX Big Data XSeries Certificate.  

Stay tuned for announcements about future runs of these and follow-up MOOCs!

BerkeleyX Data Science on Apache Spark MOOC starts today

Anthony

For the past several months, we have been working to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs are launching this month on the BerkeleyX platform!

Today we launched the first course, CS100.1x, Introduction to Big Data with Apache Spark, a brand new five-week long course on Big Data, Data Science, and Apache Spark with nearly 57,000 students (UC Berkeley’s 2014 enrollment was 37,581 students).

The first course eaches students about Apache Spark and performing data analysis. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that use real-world data to teach students how to manipulate datasets using parallel processing with PySpark.

The second course, called Scalable Machine Learning, will begin on June 29th and will introduce the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using Spark.

We would also like to thank the Spark community for their support.  Several community members are serving as teaching assistants and beta testers, and multiple study groups have been organized by community members in anticipation of these courses.

Both courses are available for free on the edX website, and you can sign up for them today:

  1. Introduction to Big Data with Apache Spark
  2. Scalable Machine Learning

For students who complete a course, the courses offer the choice of free Honor Code completion certificates or paid edX IDVerified Certificates

The courses are sponsored in part by the AMPLab and Databricks.

Spark 0.6.0 Released

Matei Zaharia

I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.

The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:

  • Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
  • Java API: exposes all of Spark’s features to Java developers in a clean manner.
  • Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
  • Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
  • Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.

As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:

  • Tathagata Das contributed the new communication layer, and parts of the storage layer.
  • Haoyuan Li contributed the new storage manager.
  • Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
  • Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
  • Josh Rosen contributed the Java API, as well as several bug fixes.
  • Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
  • Reynold Xin contributed numerous bug and performance fixes.
  • Imran Rashid contributed the new Accumulable class.
  • Harvey Feng contributed improvements to shuffle operations.
  • Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
  • Ravi Pandya contributed Spark run scripts for Windows.
  • Mosharaf Chowdhury provided several fixes to broadcast.
  • Henry Milner pointed out several bugs in sampling algorithms.
  • Ray Racine provided improvements to the EC2 scripts.
  • Paul Ruan and Bill Zhao helped with testing.

We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.

Shark Jumps Big Data (at SIGMOD)

Reynold Xin

Given all the excitement around Big Data recently, it’s not surprising that some analysts and bloggers have begun claiming that the Big Data meme has “Jumped the Shark”.    In the AMPLab we believe this is far from being true and in fact, we have exactly the opposite perspective – namely, that there remains a need for new data analytics frameworks than can scale to massive data sizes.    For this reason, we have developed the Shark system for query processing on Big Data.

Shark (whose name comes from the combination of Spark and Hive) is a data warehouse system that is compatible with Apache Hive.  Building on AMPLab’s Spark system, Shark provides two chief advantages over Hive: performance gains through both smart query optimization and caching data in a cluster’s memory as well as integration with Spark for iterative machine learning.

We will be doing a demo of the Shark system on a 100-node EC2 cluster this week at the SIGMOD conference in Phoenix.  In the demo we will interactively mine 16 months of Wikipedia hourly traffic logs.  If you are attending the conference, please drop by and say “Hi!”  (Update 5/25/12: Shark Wins Best Demo Award at SIGMOD 2012)

Performance

Analytical queries usually focus on a particular subset or time window. For example, a query might run over HTTP logs from the previous month, touching only the (small) dimension tables and a small portion of the fact table. These queries exhibit strong temporal locality, and in many cases, it is plausible to fit the working set into a cluster’s memory. Shark allows users to exploit this temporal locality by caching their working set of data, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as arrays of Java primitives), which has little storage overhead and very efficient for Java garbage collection, yet provides maximum performance (an order of magnitude faster than reading data from disk).

Additionally, Shark employs a number of optimization techniques such as limit push downs and hash-based shuffle, which can provide significant speedups in query processing.

Integration with Spark for Machine Learning

Consider the following program to run k-means clustering in Shark:

// k-means implemented in Spark
def kmeans(points: RDD[Point], k: Int) = {
  // Initialize the centroids.
  clusters = new HashMap[Int, Point]
  for (i <- 0 until k) centroids(i) = Point.random()
  for (i <- 1 until 10) {
    // Assign points to centroids and update centroids.
    clusters = points.groupBy(closestCentroid)
      .map{ (id, points) => (id, points.sum / points.size)
    }.collectMap()
  }
}

// Use SQL to select the “young” users
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.mapRows(extractFeatures)
kmeans(featureMatrix)

We allow users to exploit Spark’s inherent efficiency at iterative algorithms by providing a set of APIs that turns a SQL query into a RDD (a collection of records distributed in a cluster’s memory). Using this API, users can use SQL to explore data, while expressing more sophisticated machine learning algorithms in Spark. These machine learning algorithms run on the same set of distributed workers and can share the same distributed memory as the query processor. This enables much more efficient data pipelines and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.

Compatibility with Apache Hive

Shark achieves its compatibility with Hive by reusing Hive code as much as possible. It reuses Hive’s query language, metastore, serializers and deserializers, and user-defined function interfaces. It takes the logical query plan generated by the Hive parser and swaps out the execution engine. In Hive’s case, the execution engine uses Hadoop, so Shark instead generates its own physical plan consisting of operators written using Spark. This means that using Shark, you can run your existing Hive QL queries on your Hive warehouse without modifying the data, while enjoying the benefits of caching, better optimization and the Spark API for machine learning.

Again, we will be showcasing Shark at SIGMOD. Please check it out!

News from the First Two Spark User Meetups

Matei Zaharia

One of the neat things about doing research in big data has always been the strong open source culture in this field — many of the widely used software projects are open source, and if you release a new algorithm or tool, there’s a chance that someone will use it. About a year ago, we started making open source releases of Spark, one of the first parallel data processing frameworks in the AMP Lab stack. Spark promises 10-20x faster performance than existing tools thanks to its ability to perform computations in memory, as well as an easy-to-use programming interface in the Scala language.

Ten months later, we’re excited to see how far the Spark community has grown. In particular, it’s passed that threshold where we knew each user personally, to reach a point where we primarily get questions, code contributions, and feature requests from users that we don’t know. Most awesome of those are the ones that start with “I tried Spark and it rocked” before asking a question. This is great not just because it improves the software, but because it’s often let us discover new use cases that we didn’t anticipate, and led to new research. For example, the Orchestra work at SIGCOMM 2011 was partly motivated by users running machine learning algorithms that required large broadcasts.

To keep in touch with the community, we’ve started hosting a regular Spark User Meetup, which is a mix of tutorials, presentations of upcoming features, presentations from users, and Q&A. The first two meetups were held at Klout and Conviva. I wanted to give a quick summary of the topics presented for those who missed them:

January Meetup

  • Matei Zaharia from Berkeley talked about the goals of the Spark project and gave a tutorial on how to set it up locally or on Amazon EC2. The goal was to show people where the various new features we are developing fit in and where to find help on how to use what’s already there. The slides are available online (PPTX).
  • Karthik Thiyagarajan from Quantifind explained how they are starting to use Spark for realtime exploration of time series data. Quantifind is a startup that offers predictive analytics — identifying trends from time series to make decisions. They use Spark as an almost realtime database service, where new data is ingested periodically and can be queried interactively from a web interface. Check out Karthik’s slides for more details. This is one of the use cases we’re really interested in exploring further.

February Meetup

  • This was the first unveiling of Shark, a port of Apache Hive onto Spark that we are developing. Hive is a popular large-scale data warehouse that provides a SQL interface for running queries on Hadoop MapReduce. With Shark, we can run the same queries over cached in-memory data in Spark, leading to up to 10x better performance for interactive data mining. One neat thing about the port is that it’s backwards-compatible with Hive, using the same language, user-defined functions, and metadata store, so it can run seamlessly on existing Hive data. Cliff Engle from the Shark team gave a talk (PPTX) on Shark’s design and some initial results.
  • Dilip Joseph from Conviva talked about their use of Spark for a variety of reporting and analytics applications. Conviva provides video streaming optimization and management systems that need to deliver high-quality live video to thousands of concurrent viewers. They use Spark in combination with Hadoop and Hive to analyze the large sets of resulting logs, compute statistics over the data, and identify problems or optimization opportunities. They were some of the first production users of Spark, and today, they run 30% of their reports in Spark. Check out Dilip’s blog post on the Conviva engineering blog for more details.

Both meetups had full rooms with over 40 people from attending, which is great. We’re hoping to hold the next meetup in the first or second week of April. Please sign up for the meetup.com group if you’re interested.

In other Spark news, a demo paper on Shark, the Hive-on-Spark port we are developing, was accepted at the SIGMOD conference, while a paper on Spark itself will appear at NSDI. Additionally, the Mesos cluster manager developed in our lab, which Spark runs over, called its first Apache release vote today to release version 0.9, a major milestone that contains new usability, fault tolerance, and stability features developed at Twitter. Finally, Spark and Shark talks were accepted at Scala Days 2012 (in London, England) and the Hadoop Summit (in Sunnyvale, CA). If you’re at those conferences and want to learn more about Spark, please drop by!