Shark 0.2 Released and 0.3 Preview

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

I am happy to announce that the next major release of Shark, 0.2, is now available. Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 30 times faster than Hive without modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements. The full release notes are posted on Shark’s github wiki, but here are some highlights:

  • Hive compatibility: Hive version support is bumped up to 0.9, and UDFs/UDAFs are fully supported and can be distributed to all slaves using the ADD FILE command.
  • Simpler deployment: Shark 0.2 works with Spark 0.6’s standalone deployment mode, which means you can run Shark in cluster mode without depending on Apache Mesos. Also we worked on simplifying the deployment, and you can now set up a single node Shark instance in 5 mins, and launch a cluster on EC2 in 20 mins.
  • Thrift server mode: Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server, which is compatible with Hive’s Thrift server.
  • Performance improvements: We rewrote Shark’s join and group by code and workloads can observe 2X speedup.

In addition to the 0.2 release, we are working on the next major version, 0.3, expected to be released in November. Below is a preview of some of the features:

  • Columnar compression: We are adding fast columnar data compression. You can fit more data into your cluster’s memory without sacrificing query execution speed.
  • Memory Management Dashboard: We are working on a dashboard that shows key characteristics of the cluster, e.g. what tables are in memory versus on disk.
  • Automatic optimizations: Shark will automatically determine the right degree of parallelism and users will not have to worry about setting configuration variables.

 

Spark 0.6.0 Released

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

I’m happy to announce that the next major release of Spark, 0.6.0, is now available. Spark is a fast cluster computing engine developed at the AMP Lab that can run 30x faster than Hadoop using in-memory computing. This is the biggest Spark release to date in terms of features, as well as the biggest in terms of contributors, with over a dozen new contributors from Berkeley and outside. Apart from the visible features, such as a standalone deploy mode and Java API, it includes a significant rearchitecting of Spark under the hood that provides up to 2x faster network performance and support for even lower-latency jobs.

The major focus points in this release have been accessibility (making Spark easier to deploy and use) and performance. The full release notes are posted online, but here are some highlights:

  • Simpler deployment: Spark now has a pure-Java standalone deploy mode that lets it run without an external cluster manager, as well as experimental support for running on YARN (Hadoop NextGen).
  • Java API: exposes all of Spark’s features to Java developers in a clean manner.
  • Expanded documentation: a new documentation site, http://spark-project.org/docs/0.6.0/, contains significantly expanded docs, such as a quick start guide, tuning guide, configuration guide, and detailed Scaladoc help.
  • Engine enhancements: a new, custom communication layer and storage manager based on Java NIO provide improved performance for network-heavy operations.
  • Debugging enhancements: Spark now prints which line of your code each operation in its logs corresponds to.

As mentioned above, this release is also the work of an unprecedentedly large set of developers. Here are some of the people who contributed to Spark 0.6:

  • Tathagata Das contributed the new communication layer, and parts of the storage layer.
  • Haoyuan Li contributed the new storage manager.
  • Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
  • Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
  • Josh Rosen contributed the Java API, as well as several bug fixes.
  • Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
  • Reynold Xin contributed numerous bug and performance fixes.
  • Imran Rashid contributed the new Accumulable class.
  • Harvey Feng contributed improvements to shuffle operations.
  • Shivaram Venkataraman improved Spark’s memory estimation and wrote a memory tuning guide.
  • Ravi Pandya contributed Spark run scripts for Windows.
  • Mosharaf Chowdhury provided several fixes to broadcast.
  • Henry Milner pointed out several bugs in sampling algorithms.
  • Ray Racine provided improvements to the EC2 scripts.
  • Paul Ruan and Bill Zhao helped with testing.

We’re very proud of this release, and hope that you enjoy it. You can grab the code at http://www.spark-project.org/release-0.6.0.html.