MLlib: Machine Learning in Apache Spark

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
Journal of Machine Learning Research, 17 (34), Apr. 2016.

Tags: Machine Learning, MLlib, spark

SparkNet

Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this … Continue reading →

SparkNet: Training Deep Networks on Spark

Philipp Moritz, Robert Nishihara, Ion Stoica, Michael Jordan
International Conference on Learning Representations (ICLR), May. 2016.

Tags: deep learning, distributed machine learning, Machine Learning, spark

Succinct on Apache Spark: Queries on Compressed RDDs

Posted on November 5, 2015 by Rachit Agarwal

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random … Continue reading →

Succinct: Enabling Queries on Compressed Data

Web applications and services today collect, store and analyze an immense amount of data. As data sizes continue to grow, the … Continue reading →

Splash: Efficient Stochastic Learning on Clusters

Splash is a general framework for parallelizing stochastic learning algorithms (SGD, Gibbs sampling, etc.) on multi-node clusters. It consists of a … Continue reading →

Spark SQL: Relational Data Processing in Spark

Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, Matei Zaharia
ACM SIGMOD Conference 2015, May. 2015.

Tags: Catalyst, Dataframes, JSON, Optimization, query processing, Shark, spark, SQL

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Michael Franklin, Anthony Joseph, David Patterson
ACM SIGMOD Conference, May. 2015.

Tags: genomics, spark

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Ion Stoica, Scott Shenker
SOSP, Nov. 2013.

Tags: D-streams, spark, spark streaming

GraphX: Graph Processing in a Distributed Dataflow Framework

Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica
OSDI, Oct. 2014.

Tags: Big Data, dataflow, Graphs, Graphx, query processing, spark

Fine-grained Partitioning for Aggressive Data Skipping

Liwen Sun, Michael Franklin, Sanjay Krishnan, Reynold Xin
ACM SIGMOD, Jun. 2014.

Tags: algorithms, Big Data, data warehouse, databases, partitioning, physical database design, spark

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing

Matt Massie, Frank Austin Nothaft, Chris Hartl, Christos Kozanitis, Andre Schumacher, Anthony Joseph, David Patterson
Dec. 2013.

Tags: genomics, spark

Fast and Interactive Analytics over Hadoop Data with Spark

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
Usenix ;login:, Aug. 2012.

Tags: BDAS, Big Data, hadoop, spark

Shark: SQL and Rich Analytics at Scale

Reynold Xin, Joshua Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
ACM SIGMOD Conference, Jun. 2013.

Tags: Big Data, spark, SQL, Warehouse

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (Best Demo Award)

Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Demonstration Paper, ACM SIGMOD Conference , May. 2012.

Tags: BDAS, Best Paper Award, Big Data, Shark, spark, SQL

Traffic jams, cell phones and big data

Posted on January 18, 2012 by Timothy Hunter

(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly) It is well known that big data processing is becoming … Continue reading →

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Best Paper Award)

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
NSDI 2012, Apr. 2012.

Tags: BDAS, Best Paper Award, Big Data, spark, storage

Spark – Lightning-Fast Cluster Computing

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and … Continue reading →