I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.
The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:
- Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
- Java API: exposes all of Spark’s features to Java developers in a clean manner.
- Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
- Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
- Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.
As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:
- Tathagata Das contributed the new communication layer, and parts of the storage layer.
- Haoyuan Li contributed the new storage manager.
- Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
- Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
- Josh Rosen contributed the Java API, as well as several bug fixes.
- Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
- Reynold Xin contributed numerous bug and performance fixes.
- Imran Rashid contributed the new Accumulable class.
- Harvey Feng contributed improvements to shuffle operations.
- Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
- Ravi Pandya contributed Spark run scripts for Windows.
- Mosharaf Chowdhury provided several fixes to broadcast.
- Henry Milner pointed out several bugs in sampling algorithms.
- Ray Racine provided improvements to the EC2 scripts.
- Paul Ruan and Bill Zhao helped with testing.
We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.