Succinct on Apache Spark: Queries on Compressed RDDs

Rachit Agarwal

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random access) directly on a compressed representation of the input data. We are very excited to release Succinct as an Apache Spark package, that enables search, count, range and random access queries on compressed RDDs. This release allows users to use Apache Spark as a document store (with search on documents) similar to ElasticSearch, a key value interface (with search on values) similar to HyperDex, and an experimental DataFrame interface (with search along columns in a table). When used as a document store, Apache Spark with Succinct is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, and over 75x faster than native Apache Spark.

Succinct on Apache Spark Overview

Search is becoming an increasingly powerful primitive in big data analytics and web services. Many web services support some form of search, including LinkedIn searchTwitter search, Facebook search, Netflix search, airlines, hotels, as well as services specifically built around search — Google, Bing, Yelp, to name a few. Apache Spark supports search via full RDD scans. While fast enough for small datasets, data scans become inefficient as dataset become even moderately large. One way to avoid data scans is to implement indexes, but can significantly increase the memory overhead.

We are very excited to announce the release of Succinct as an Apache Spark package, that achieves a unique tradeoff — storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques. Succinct on Apache Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs. What differentiates Succinct on Apache Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression — all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD. 

In addition, Succinct on Apache Spark supports random access of records without scanning the entire RDD, a functionality that we believe will significantly speed up a large number of applications.

An example

Consider a collection of Wikipedia articles stored on HDFS as a flat unstructured file. Let us see how Succinct on Apache Spark supports the above functionalities:

// Import relevant Succinct classes
import edu.berkeley.cs.succinct._ 

// Read an RDD as a collection of articles; sc is the SparkContext
val articlesRDD = ctx.textFile("/path/to/data").map(_.getBytes)

// Compress the input RDD into a Succinct RDD, and persist it in memory
// Note that this is a time consuming step (usually at 8GB/hour/core) since data needs to be compressed. 
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succcinct.persist()

// SuccinctRDD supports a set of powerful primitives directly on compressed RDD
// Let us start by counting the number of occurrences of "Berkeley" across all Wikipedia articles
val count = succinctRDD.count("Berkeley")

// Now suppose we want to find all offsets in the collection at which “Berkeley” occurs; and 
// create an RDD containing all resulting offsets 
val offsetsRDD = succinctRDD.search("Berkeley")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the occurrences of “Berkeley”
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

Many more examples on using Succinct on Apache Spark are outlined here.

Performance

kv-document-search-2

The figure compares the search performance of Apache Spark with Succinct against ElasticSearch and native Apache Spark. We use a 40GB collection of Wikipedia documents over a 4-server Amazon EC2 cluster with 120GB RAM (so that all systems fit in memory). The search queries use words with varying number of occurrences (1–10,000) with uniform random distribution across 10 bins (1–1000, 1000-2000, etc). Note that the y-axis is on log scale.

Interestingly, Apache Spark with Succinct is roughly 2.75x faster than Elasticsearch. This is when ElasticSearch does not have the overhead of Apache Spark’s job execution, and have all the data fit in memory. Succinct achieves this speed up while requiring roughly 2.5x lower memory than ElasticSearch (due to compression, and due to storing no additional indexes)! Succinct on Apache Spark is over two orders of magnitude faster than Apache Spark’s native RDDs due to avoiding data scans. Random access on documents has similar performance gains (with some caveats).

Below, we describe a few interesting use cases for Succinct on Apache Spark, including a number of interfaces exposed in the release. For more details on the release (and Succinct in general), usage and benchmark results, please see Succinct webpage, the NSDI paper, or a more detailed technical report.

Succinct on Apache Spark: Abstractions and use cases

Succinct on Apache Spark exposes three interfaces, each of which may have several interesting use cases. We outline some of them below:

  • SuccinctRDD
    • Interface: Flat (unstructured) files
    • Example application: log analytics
    • Example: one can search across logs (e.g., errors for debugging), or perform random access (e.g., extract logs at certain timestamps).
    • System with similar functionality: Lucene
  • SuccinctKVRDD

    • Interface: Semi-structured data
    • Example application: document stores, key-value stores
    • Example: 
      • (document stores) search across a collection of Wikipedia documents and return all documents that contain, say, string “University of California at Berkeley”. Extract all (or a part of) documents.
      • (key-value stores) search across a set of tweets stored in a key-value store for tweets that contain “Succinct”. Extract all tweets from the user “_ragarwal_”.
    • System with similar functionality: ElasticSearch
  • (An experimental) DataFrame interface
    • Interface: Search and random access on structured data like tables
    • Example applications: point queries on columnar stores
    • Example: given a table with schema {userID, location, date-of-birth, salary, ..}, find all users who were born between 1980 and 1985.
    • Caveat: We are currently working on some very exciting projects to support a number of additional SQL operators efficiently directly on compressed RDDs.

When not to use Succinct on Apache Spark

There are a few applications that are not suitable for Succinct on Apache Spark — long sequential reads, and search for strings that occur very frequently (you may not want to search for “a” or “the”). We outline the associated tradeoffs on Succinct webpage as well.

Looking Ahead

We at AMPLab are working on several interesting projects to make Succinct on Apache Spark more memory efficient, faster and more expressive. To give you an idea about what is next, we are going to close this post with a hint on our next post: executing Regular Expression queries directly on compressed RDDs. Stay tuned!

How to Build A Bad Research Center

Patterson

The AMPLab is part of a Berkeley tradition of creating 5-year multidisciplinary projects that build prototypes to demonstrate the project vision and depend on biannual retreats for feedback and open shared space to inspire collaboration.

After being involved in a dozen centers over nearly 40 years, I decided to capture my advice on building and running research centers . Following the precedent of my past efforts at  “How to Give a Bad Talk” and “How to Have a Bad Career“, I just finished a short technical paper entitled “How to Build a Bad Research Center.”

As a teaser, below are my Eight Commandments to follow to build a bad research center:

  1. Thou shalt not mix disciplines in a center. It is difficult for people from different disciplines to talk to each other, as they don’t share a common culture or vocabulary.  Thus, multiple disciplines waste time, and therefore precious research funding. Instead, remain pure.
  2. Thou shalt expand centers. Expanse is measured geographically, not intellectually. For example, in the US the ideal is having investigators from 50 institutions in all 50 states, as this would make a funding agency look good to the US Senate.
  3. Thou shalt not limit the duration of a center. To demonstrate your faith in the mission of the center, you should be willing to promise to work on it for decades. (Or at least until the funding runs out.)
  4. Thou shalt not build a graven prototype. Integrating results in a center-wide prototype takes time away from researchers’ own, more important, private research.
  5. Thou shalt not disturb thy neighbors. Good walls make good researchers; isolation reduces the chances of being distracted from your work.
  6. Thou shalt not talk to strangers. Do not waste time convening meetings to present research to outsiders; following the 8th commandment, reviews of your many papers supply sufficient feedback.
  7. Thou shalt make decisions as a consensus of equals. The US Congress is a sterling example of making progress via consensus.
  8. Thou shalt honor thy paper publishers. Thus, to ensure center success, you must write, write, write and cite, cite, cite. If the conference acceptance rate is 1/X, then obviously you should submit at least X papers, for otherwise chances are that your center will not have a paper at every conference, which is a catastrophe.

Traffic jams, cell phones and big data

Timothy Hunter

(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly)

It is well known that big data processing is becoming increasingly important in many scientific fields including astronomybiomedicine and climatology.  In addition, newly created hybrid disciplines like biostatisics are an even stronger indicators of this overall trend. Other fields like civil engineering, and in particular transportation, are no exception to the rule and the AMP Lab is actively collaborating with the department of Intelligent Transportation Systems at Berkeley to explore this new frontier.

It comes as no surprise to residents of California that congestion on the streets is a major challenge that affects everyone. While it is well studied for highways, it remains an open question for urban streets (also called the arterial road network). So far, the most promising source of data is the GPS of cellphones. However, a large volume of this very noisy data is required in order to maintain a good level of accuracy. The rapid adoption of smartphones, all equipped with GPS, is changing the game. I introduce in this post some ongoing efforts to combine Mobile Millennium, a state-of-the-art transportation framework, with the AMPLab software stack.

What does this GPS data look like? Here is an example in the San Francisco Bay area: a few hundred taxicabs relay their position every minute in real time to our servers.

The precise trajectories of the vehicles are unobserved and need to be reconstructed using a sophisticated map matching pipeline implemented in Mobile Millennium. The results of this process are some timestamped trajectory segments. These segments are the basic observations to predict traffic.


Our traffic estimation algorithms work by guessing a probability distribution of the travel time on each link of the road network. This process is iterated to improve the quality of estimates. This overall algorithm is intensive both in terms of computations and memory. Fortunately, it also fits into the category of “embarrassingly parallel” algorithms and is a perfect candidate for distributed computing.
Implementing a high-performance algorithm as a distributed system is not an easy task. Instead of implementing this by hand, our implementation relies on Spark, a programming framework in Scala. Thanks to the Spark framework, we were able to port our single machine implementation to the EC2 cloud within a few weeks to achieve nearly linear scaling. In a future post, I will discuss some practical considerations we faced to when  integrating the AMPLab stack with the Mobile Millennium system.

 

An AMP Blab about some recent system conferences – Part 1: SOSP 2011

ychen

I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student – in no way does it represent any official or unanimous AMP Lab position.

Part 1: Symposium on Operating System Principles (SOSP) 2011

A very diverse and high quality technical program, as expected. You can find the proceedings and talk slides/videos at http://sigops.org/sosp/sosp11/current/index.html.

One high-level reaction I have from the conference is that AMP Lab’s observe-analyze-act design loop position us well to identify emerging technology trends, and design systems with high impact under real life scenarios. Our industry partnerships would also allow us to address engineering concerns beyond the laboratory, thus expedite bilateral knowledge transfer between academia and industry.

One best-paper award went to “A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications”, authored by our friends from Univ. of Wisconsin, Professors Andrea and Remzi Arpaci-Dusseau, as well as their students. The paper did an elaborate study of Apple laptop file traces, and found many pathological behavior. For example, a file “write” actually writes a file multiple times, a file “open” touches a great number of seemingly unrelated files.

Another best-paper award went to “Cells: A Virtual Mobile Smartphone Architecture” from Columbia. This study proposes and implements “virtual phones”, the same idea as virtual machines, for example running a “work phone” and a “home phone” on the same physical device. The talk highlight was a demo of two versions of the Angry Birds game running simultaneously on the same phone.

The audiences-choice best presentation award went to “Atlantis: Robust, Extensible Execution Environments for Web Applications”, a joint work between MSR and Rutgers. The talk very humorously surveyed the defects of current Internet browsers, and proposes an “exokernel browser” architecture in which web applications have the flexibility to define their own execution stack, e.g. markup languages, scripting environments, etc. As expected, the talk catalyzed very entertaining questioning from companies with business interests in the future of web browsers.

Also worthy of highlighting – the session on Security contained three papers, all three have Professor Nickolai Zeldovich on the author list, and all three of high quality. I have not done a thorough historical search, but I’m sure it’s rare that a single author manages to fill a complete session at SOSP.

There was also a very lively discussion on ACM copyright policies during the SIGOPS working dinner. I personally believe it’s vital that we find policies that balances concern about upholding the quality of research, preserving the strength of the research community, and facilitating the sharing of cutting edge knowledge and insights.

My own talk on “Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis” went very well. This is a study that performs an empirical analysis on large scale enterprise storage traces, identify different workloads, and discuss design insights specifically targeted at each workload. The rigorous trace analysis allow us to identify simple, threshold-based storage system optimizations, with high confidence that the optimizations bring concrete benefit under realistic settings. Big thank you to everyone at AMP Lab and our co-authors at NetApp for helping me prepare the talk!

Lisbon travel note 1: If history/food is dear to your heart, you will find it worthwhile to visit the Jerónimos Monastery, and try the Pasteis de Nata sold nearby. This is THE authentic egg tart, originated at the Monastery, and very good for a mid-day sugar-high. I had too many – I felt too happy after eating the first 10, lost count of how many more I ate, and skipped lunch and dinner for that day.

Scale-Independent Query Processing With PIQL

Michael Armbrust

(This is joint work with Kristal Curtis and Tim Kraska.)

The Internet is littered with stories of traditional relational database failing to meet the performance needs of fast growing internet sites.  The story usually goes as follows: Everything works great when the site is small. Suddenly, the site becomes popular and queries start running slowly. As a result, the developers abandon the relational database and switch to writing queries imperatively against a distributed key/value store.  Examples of sites that use distributed key/value stores under the covers include digg, facebook, and twitter, along with many others.

One key driver of this NoSQL movement is the fact that the data independence provided by an RDBMS actually exacerbates the scaling problem by hiding potentially expensive queries behind the relative simplicity of high-level declarative queries.  In contrast, the simple get/put interface of most NoSQL stores provides predictable per-operation performance, independent of the size of the underlying database.

The NoSQL ‘solution,’ however, leads to its own set of problems.  The use of imperative functions instead of declarative queries means that changes to the data model often require time-consuming rewrites.  Additionally, developers have to manually parallelize key/value store requests or suffer delays due to sequential execution. In other words, the benefits of physical and logical data independence are lost.

Instead, we propose PIQL (pronounced pickle), the Performance-Insightful Query Language.  In addition to the physical and logical data independence provided by a traditional system, PIQL ensures the scale independence of all queries in an application at compile time.  A scale-independent query is guaranteed to perform only a bounded number of operations no matter how large the underlying database grows.

Some systems, for example Google AppEngine’s GQL, impose severe functional restrictions, such as removing joins, in order to ensure scalability. In contrast, PIQL employs language extensions, query compilation technology, and response-time estimation to provide scale independence over a larger and more powerful subset of SQL.

We evaluated our ideas by building a prototype on top of SCADS, a distributed key/value store first proposed at CIDR 2009.  Using this prototype we constructed two benchmarks, one based on Twitter and the other on the user-facing queries from TPC-W.

The throughput of the system while running the TPC-W benchmark increases linearly as machines are added to the cluster.

As the above figure shows, systems built using PIQL scale linearly as machines are added while keeping response time constant.  If you’d like to learn more, you can check out our paper appearing in VLDB2012 or look at some of the queries written in our LINQ-like Scala DSL.