One of the neat things about doing research in big data has always been the strong open source culture in this field — many of the widely used software projects are open source, and if you release a new algorithm or tool, there’s a chance that someone will use it. About a year ago, we started making open source releases of Spark, one of the first parallel data processing frameworks in the AMP Lab stack. Spark promises 10-20x faster performance than existing tools thanks to its ability to perform computations in memory, as well as an easy-to-use programming interface in the Scala language.
Ten months later, we’re excited to see how far the Spark community has grown. In particular, it’s passed that threshold where we knew each user personally, to reach a point where we primarily get questions, code contributions, and feature requests from users that we don’t know. Most awesome of those are the ones that start with “I tried Spark and it rocked” before asking a question. This is great not just because it improves the software, but because it’s often let us discover new use cases that we didn’t anticipate, and led to new research. For example, the Orchestra work at SIGCOMM 2011 was partly motivated by users running machine learning algorithms that required large broadcasts.
To keep in touch with the community, we’ve started hosting a regular Spark User Meetup, which is a mix of tutorials, presentations of upcoming features, presentations from users, and Q&A. The first two meetups were held at Klout and Conviva. I wanted to give a quick summary of the topics presented for those who missed them:
- Matei Zaharia from Berkeley talked about the goals of the Spark project and gave a tutorial on how to set it up locally or on Amazon EC2. The goal was to show people where the various new features we are developing fit in and where to find help on how to use what’s already there. The slides are available online (PPTX).
- Karthik Thiyagarajan from Quantifind explained how they are starting to use Spark for realtime exploration of time series data. Quantifind is a startup that offers predictive analytics — identifying trends from time series to make decisions. They use Spark as an almost realtime database service, where new data is ingested periodically and can be queried interactively from a web interface. Check out Karthik’s slides for more details. This is one of the use cases we’re really interested in exploring further.
- This was the first unveiling of Shark, a port of Apache Hive onto Spark that we are developing. Hive is a popular large-scale data warehouse that provides a SQL interface for running queries on Hadoop MapReduce. With Shark, we can run the same queries over cached in-memory data in Spark, leading to up to 10x better performance for interactive data mining. One neat thing about the port is that it’s backwards-compatible with Hive, using the same language, user-defined functions, and metadata store, so it can run seamlessly on existing Hive data. Cliff Engle from the Shark team gave a talk (PPTX) on Shark’s design and some initial results.
- Dilip Joseph from Conviva talked about their use of Spark for a variety of reporting and analytics applications. Conviva provides video streaming optimization and management systems that need to deliver high-quality live video to thousands of concurrent viewers. They use Spark in combination with Hadoop and Hive to analyze the large sets of resulting logs, compute statistics over the data, and identify problems or optimization opportunities. They were some of the first production users of Spark, and today, they run 30% of their reports in Spark. Check out Dilip’s blog post on the Conviva engineering blog for more details.
Both meetups had full rooms with over 40 people from attending, which is great. We’re hoping to hold the next meetup in the first or second week of April. Please sign up for the meetup.com group if you’re interested.
In other Spark news, a demo paper on Shark, the Hive-on-Spark port we are developing, was accepted at the SIGMOD conference, while a paper on Spark itself will appear at NSDI. Additionally, the Mesos cluster manager developed in our lab, which Spark runs over, called its first Apache release vote today to release version 0.9, a major milestone that contains new usability, fault tolerance, and stability features developed at Twitter. Finally, Spark and Shark talks were accepted at Scala Days 2012 (in London, England) and the Hadoop Summit (in Sunnyvale, CA). If you’re at those conferences and want to learn more about Spark, please drop by!