New BDAS Features

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

In this post we briefly describe four of the newest features in the Berkeley Data Analytics Stack (BDAS):

  1. GraphX: large-scale interactive graph analytics
  2. BlinkDB: real-time analytics through bounded approximations
  3. MLbase: scalable machine learning library
  4. Tachyon: reliable data sharing at memory speed across cluster frameworks

GraphX

GraphX is a distributed interactive graph computation system integrated with Spark.  GraphX exposes a new API that treats tables and graphs as composable objects enabling users to easily construct graphs from tables, join graphs and tables, and apply iterative graph algorithms (e.g., PageRank and community detection) using Pregel like operators.  The GraphX system combines recent advances in graph processing systems with distributed join optimizations and lineage to efficiently execute distributed graph computation in the context of fully fault-tolerant data-parallel platform.  On top of GraphX we have built a collection of useful tools for graph analytics.

BlinkDB

BlinkDB is a large-scale data warehouse system built on Shark and Spark that  aims to achieve real-time (i.e., sub-second) query response times for a variety of SQL-based aggregation queries (augmented by a time and/or error bound) on massive amounts of data. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results.

MLbase

The MLbase project’s aim is to provide fast, scalable, and easy to use Machine Learning on top of Spark and is composed of three core components.   MLlib is the first production-ready component of the MLbase project and is the standard library for machine learning on Spark.  Released as part of Spark 0.8.0, MLlib includes fast, scalable algorithms for classification, regression, clustering, collaborative filtering, and convex optimization.  The second component of MLbase, MLI, is an API for distributed machine learning. By offering an API that is familiar to Machine Learning developers, MLI provides a DSL for development of new machine learning algorithms on top of Spark.  The final component of MLbase, the MLbase Optimizer, attempts to automate the process of model selection and ML pipeline construction, and is an area of active research within the AMPlab.

Tachyon

Tachyon is a fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.

Learn more at Strata

At Strata 2014 we will be hosting AMPCamp4 with hands-on exercises (available online as well) to help people get started using BDAS and the exciting new features we have been developing.

You can register here for AMP Camp 4 and the 2014 Strata Conference. Use the code AMP20 when registering to get 20% off your ticket price. The conference offers one day of tutorials (Feb 11) and two says of presentations (Feb 12-13). Please make sure to select the tutorials day if you wish to join us at the AMP Camp.