Fast and Interactive Analytics over Hadoop Data with Spark

The past few years have seen tremendous interest in large-scale data analysis, as data volumes in both industry and research continue to outgrow the processing speed of individual machines. Google’s MapReduce model and its open source implementation, Hadoop, kicked off an ecosystem of parallel data analysis tools for large clusters, such as Apache’s Hive and Pig engines for SQL processing; however, these tools have so far been optimized for one-pass batch processing of on-disk data, which makes them slow for interactive data exploration and for the more complex multi-pass analytics algorithms that are becoming common.  In this article, we introduce Spark, a new cluster computing framework that can run applications up to 40× faster than Hadoop by keeping data in memory, and can be used interactively to query large datasets with sub-second latency.