Succinct on Apache Spark: Queries on Compressed RDDs

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random access) directly on a compressed representation of the input data. We are very excited to release Succinct as an Apache Spark package, that enables search, count, range and random access queries on compressed RDDs. This release allows users to use Apache Spark as a document store (with search on documents) similar to ElasticSearch, a key value interface (with search on values) similar to HyperDex, and an experimental DataFrame interface (with search along columns in a table). When used as a document store, Apache Spark with Succinct is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, and over 75x faster than native Apache Spark.

Succinct on Apache Spark Overview

Search is becoming an increasingly powerful primitive in big data analytics and web services. Many web services support some form of search, including LinkedIn searchTwitter search, Facebook search, Netflix search, airlines, hotels, as well as services specifically built around search — Google, Bing, Yelp, to name a few. Apache Spark supports search via full RDD scans. While fast enough for small datasets, data scans become inefficient as dataset become even moderately large. One way to avoid data scans is to implement indexes, but can significantly increase the memory overhead.

We are very excited to announce the release of Succinct as an Apache Spark package, that achieves a unique tradeoff — storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques. Succinct on Apache Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs. What differentiates Succinct on Apache Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression — all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD. 

In addition, Succinct on Apache Spark supports random access of records without scanning the entire RDD, a functionality that we believe will significantly speed up a large number of applications.

An example

Consider a collection of Wikipedia articles stored on HDFS as a flat unstructured file. Let us see how Succinct on Apache Spark supports the above functionalities:

// Import relevant Succinct classes
import edu.berkeley.cs.succinct._ 

// Read an RDD as a collection of articles; sc is the SparkContext
val articlesRDD = ctx.textFile("/path/to/data").map(_.getBytes)

// Compress the input RDD into a Succinct RDD, and persist it in memory
// Note that this is a time consuming step (usually at 8GB/hour/core) since data needs to be compressed. 
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succcinct.persist()

// SuccinctRDD supports a set of powerful primitives directly on compressed RDD
// Let us start by counting the number of occurrences of "Berkeley" across all Wikipedia articles
val count = succinctRDD.count("Berkeley")

// Now suppose we want to find all offsets in the collection at which “Berkeley” occurs; and 
// create an RDD containing all resulting offsets 
val offsetsRDD = succinctRDD.search("Berkeley")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the occurrences of “Berkeley”
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

Many more examples on using Succinct on Apache Spark are outlined here.

Performance

kv-document-search-2

The figure compares the search performance of Apache Spark with Succinct against ElasticSearch and native Apache Spark. We use a 40GB collection of Wikipedia documents over a 4-server Amazon EC2 cluster with 120GB RAM (so that all systems fit in memory). The search queries use words with varying number of occurrences (1–10,000) with uniform random distribution across 10 bins (1–1000, 1000-2000, etc). Note that the y-axis is on log scale.

Interestingly, Apache Spark with Succinct is roughly 2.75x faster than Elasticsearch. This is when ElasticSearch does not have the overhead of Apache Spark’s job execution, and have all the data fit in memory. Succinct achieves this speed up while requiring roughly 2.5x lower memory than ElasticSearch (due to compression, and due to storing no additional indexes)! Succinct on Apache Spark is over two orders of magnitude faster than Apache Spark’s native RDDs due to avoiding data scans. Random access on documents has similar performance gains (with some caveats).

Below, we describe a few interesting use cases for Succinct on Apache Spark, including a number of interfaces exposed in the release. For more details on the release (and Succinct in general), usage and benchmark results, please see Succinct webpage, the NSDI paper, or a more detailed technical report.

Succinct on Apache Spark: Abstractions and use cases

Succinct on Apache Spark exposes three interfaces, each of which may have several interesting use cases. We outline some of them below:

  • SuccinctRDD
    • Interface: Flat (unstructured) files
    • Example application: log analytics
    • Example: one can search across logs (e.g., errors for debugging), or perform random access (e.g., extract logs at certain timestamps).
    • System with similar functionality: Lucene
  • SuccinctKVRDD

    • Interface: Semi-structured data
    • Example application: document stores, key-value stores
    • Example: 
      • (document stores) search across a collection of Wikipedia documents and return all documents that contain, say, string “University of California at Berkeley”. Extract all (or a part of) documents.
      • (key-value stores) search across a set of tweets stored in a key-value store for tweets that contain “Succinct”. Extract all tweets from the user “_ragarwal_”.
    • System with similar functionality: ElasticSearch
  • (An experimental) DataFrame interface
    • Interface: Search and random access on structured data like tables
    • Example applications: point queries on columnar stores
    • Example: given a table with schema {userID, location, date-of-birth, salary, ..}, find all users who were born between 1980 and 1985.
    • Caveat: We are currently working on some very exciting projects to support a number of additional SQL operators efficiently directly on compressed RDDs.

When not to use Succinct on Apache Spark

There are a few applications that are not suitable for Succinct on Apache Spark — long sequential reads, and search for strings that occur very frequently (you may not want to search for “a” or “the”). We outline the associated tradeoffs on Succinct webpage as well.

Looking Ahead

We at AMPLab are working on several interesting projects to make Succinct on Apache Spark more memory efficient, faster and more expressive. To give you an idea about what is next, we are going to close this post with a hint on our next post: executing Regular Expression queries directly on compressed RDDs. Stay tuned!

The BerkeleyX XSeries on Big Data is Complete!

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

This is a follow up post to our earlier posts about two freely available Massive Open Online Courses (MOOCs) we offered this summer as part of the BerkeleyX Big Data XSeries. The courses were the result of a collaboration between professors Anthony D. Joseph (UC Berkeley) and Ameet Talwalkar (UCLA) with generous sponsorship by Databricks. We are pleased to report that we have completed the first successful runs of both courses, with highly positive student feedback along with enrollment, engagement, and completion rates that are two to five times the averages for MOOCs.

The first course, CS100.1x Introduction to Big Data with Apache Spark, introduced nearly 76,000 students to data science concepts and showed them how to use Spark to perform large-scale analyses through hands-on programming exercises with real-world datasets. Over 35% of the students were active in the course and the course completion rate was 11%. As an alternative to Honor Code completion certificates, we offered a $50 ID Verified Certificate option and more than 4% of the students chose this option. Over 83% of students enrolled in the Verified Certificate option completed the course.

The second course, CS190.1x Scalable Machine Learning, leveraged Spark to introduce students to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines. This course also had notably high enrollment (50K), completion rate (15%), percentage of Verified Certificate students (6%), and completion rate for verified students (88%). Overall, 1,800 students earned verified certificates for both courses and received a BerkeleyX Big Data XSeries Certificate.  

Stay tuned for announcements about future runs of these and follow-up MOOCs!

BerkeleyX Data Science on Apache Spark MOOC starts today

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

For the past several months, we have been working to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs are launching this month on the BerkeleyX platform!

Today we launched the first course, CS100.1x, Introduction to Big Data with Apache Spark, a brand new five-week long course on Big Data, Data Science, and Apache Spark with nearly 57,000 students (UC Berkeley’s 2014 enrollment was 37,581 students).

The first course eaches students about Apache Spark and performing data analysis. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that use real-world data to teach students how to manipulate datasets using parallel processing with PySpark.

The second course, called Scalable Machine Learning, will begin on June 29th and will introduce the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using Spark.

We would also like to thank the Spark community for their support.  Several community members are serving as teaching assistants and beta testers, and multiple study groups have been organized by community members in anticipation of these courses.

Both courses are available for free on the edX website, and you can sign up for them today:

  1. Introduction to Big Data with Apache Spark
  2. Scalable Machine Learning

For students who complete a course, the courses offer the choice of free Honor Code completion certificates or paid edX IDVerified Certificates

The courses are sponsored in part by the AMPLab and Databricks.

Moore’s Law B. 1965, D. 2015

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

Gordon Moore (Berkeley class of 1959) made the most incredible technology observation* on April 19, 1965 when he suggested that the number of transistors per integrated circuit would double every year, so that by 1975 there could be 65,000 transistors on a single chip when there were only 64 in 1965. It’s a bold prediction of exponential growth. ( See related article in Barrons ).

Here is his second paragraph on the consequences of such growth:

“Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today.”

Moore’s Law lasted for half a century and changed the world as Moore predicted in his 1965 article, but it has ended. Chip technology is still improving, but not at the exponential rate predicted by Moore’s Law. In fact, each generation is taking longer than the previous one. For example, the biggest Intel microprocessor in 2011 used 2.3 billion transistors, and 4 years later the largest one is “only” 5.5 billion transistors. It’s still an amazing technology, and it will continue to improve, but not at the breathtaking rate of the past 50 years.

In 2003 Gordon Moore said

“No exponential is forever … but we can delay ‘forever’.”

Looks like forever was delayed another decade, but no more.

*Gordon E. Moore, “Cramming More Components onto Integrated Circuits,” Electronics, pp. 114–117, April 19, 1965.

Introducing AMPCrowd: a simple service for declarative cross-crowd microtasking.

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

Crowdsourcing platforms like Amazon’s Mechanical Turk (AMT) make it possible to assign human workers small ‘microtasks’, such as labeling images or detecting duplicate products, in order to apply the power of human intellect at scale to problems that cannot be fully automated. These platforms often provide programmatic interfaces for microtask management upon which those of us researching the ‘P’ in ‘AMP’ rely heavily. Unfortunately, using those APIs to manage the lifecycle of human computation tasks can be a real hassle. For example, a standard workflow when using a crowd platform like AMT to detect duplicate products involves:

  • Designing a task interface in HTML, Javascript and CSS to allow users to select pairs of products that are identical.
  • Implementing and hosting a web service (with ssl support) to present the task interface to workers on the AMT website.
  • Using the AMT API to create a batch of new tasks and to send each task to multiple workers to ensure high quality results.
  • Using the AMT API to fetch results when the batch has been processed by the workers.
  • Writing custom code to merge the responses of multiple workers for the same task (either simple majority voting, or more sophisticated quality control).
  • Storing the results in a database for future access.

Though much of the code supporting this workflow can theoretically be reused in subsequent crowd-based projects, it seldom turns out that way in practice. In response, we’re releasing AMPCrowd: an extensible web service with a simple RESTful API for managing the crowd task lifecycle. AMPCrowd makes it easy to run crowd tasks on existing platforms, is extensible to new types of microtasks and new crowd platforms, and provides state-of-the-art quality control techniques with no user effort.

Using AMPCrowd, the workflow described above can be accomplished with a single declarative API call to create a new batch of tasks. The user specifies the type of the tasks (e.g. ‘duplicate detection’), the number of distinct worker responses to get for each task, the crowd platform on which to process the tasks (current options are AMT and a local interface for in-house workers), and a JSON dictionary with the data needed to render each task (e.g. the pairs of products to consider). AMPCrowd transparently handles the work of sending tasks to the selected crowd platform, fetching results as workers complete them, and running quality control algorithms to improve the resulting output. Those looking to get results in real-time can pass in a callback URL to receive push notifications as workers finish each individual task. Otherwise, the results are available in AMPCrowd’s PostgreSQL database.

AMPCrowd’s modular design makes it easy to add both new task types and support for new crowd platforms. Adding a new task type is as simple as writing the HTML to display a single task. Adding support for a new crowd platform is a bit more involved, but can be done by implementing a common interface for creating and managing tasks in a standalone Django app–no need to modify existing code.

AMPCrowd is implemented in python using the Django web framework, and a recent contribution from one of our collaborators (thanks @EntilZhaPR at Trulia!) provides Docker support, so the code can be deployed with a single command. Check out the repository at https://github.com/amplab/ampcrowd, or read our (sparse but growing) documentation at http://amplab.github.io/ampcrowd/.

Spark 0.6.0 Released

Error: Unable to create directory uploads/2024/03. Is its parent directory writable by the server?

I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.

The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:

  • Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
  • Java API: exposes all of Spark’s features to Java developers in a clean manner.
  • Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
  • Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
  • Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.

As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:

  • Tathagata Das contributed the new communication layer, and parts of the storage layer.
  • Haoyuan Li contributed the new storage manager.
  • Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
  • Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
  • Josh Rosen contributed the Java API, as well as several bug fixes.
  • Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
  • Reynold Xin contributed numerous bug and performance fixes.
  • Imran Rashid contributed the new Accumulable class.
  • Harvey Feng contributed improvements to shuffle operations.
  • Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
  • Ravi Pandya contributed Spark run scripts for Windows.
  • Mosharaf Chowdhury provided several fixes to broadcast.
  • Henry Milner pointed out several bugs in sampling algorithms.
  • Ray Racine provided improvements to the EC2 scripts.
  • Paul Ruan and Bill Zhao helped with testing.

We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.