Shark 0.2 Released and 0.3 Preview

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

I am happy to announce that the next major release of Shark, 0.2, is now available. Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 30 times faster than Hive without modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements. The full release notes are posted on Shark’s github wiki, but here are some highlights:

  • Hive compatibility: Hive version support is bumped up to 0.9, and UDFs/UDAFs are fully supported and can be distributed to all slaves using the ADD FILE command.
  • Simpler deployment: Shark 0.2 works with Spark 0.6’s standalone deployment mode, which means you can run Shark in cluster mode without depending on Apache Mesos. Also we worked on simplifying the deployment, and you can now set up a single node Shark instance in 5 mins, and launch a cluster on EC2 in 20 mins.
  • Thrift server mode: Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server, which is compatible with Hive’s Thrift server.
  • Performance improvements: We rewrote Shark’s join and group by code and workloads can observe 2X speedup.

In addition to the 0.2 release, we are working on the next major version, 0.3, expected to be released in November. Below is a preview of some of the features:

  • Columnar compression: We are adding fast columnar data compression. You can fit more data into your cluster’s memory without sacrificing query execution speed.
  • Memory Management Dashboard: We are working on a dashboard that shows key characteristics of the cluster, e.g. what tables are in memory versus on disk.
  • Automatic optimizations: Shark will automatically determine the right degree of parallelism and users will not have to worry about setting configuration variables.