Time-Evolving Graph Processing at Scale

Anand Padmanabha Iyer, Li Erran Li, Tathagata Das, Ion Stoica
Graph Data-management Experiences & Systems (GRADES), Jun. 2016.

Tags: Big Data, graph, graph analytics, graph processing

CoCoA: A Framework for Distributed Optimization

A major challenge in many large-scale machine learning tasks is to solve an optimization objective involving data that is distributed … Continue reading →

Succinct on Apache Spark: Queries on Compressed RDDs

Posted on November 5, 2015 by Rachit Agarwal

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random … Continue reading →

Succinct: Enabling Queries on Compressed Data

Web applications and services today collect, store and analyze an immense amount of data. As data sizes continue to grow, the … Continue reading →

KeystoneML

KeystoneML is a research project exploring techniques to simplify the construction of large scale, end-to-end, machine learning pipelines. KeystoneML is designed around … Continue reading →

Splash: Efficient Stochastic Learning on Clusters

Splash is a general framework for parallelizing stochastic learning algorithms (SGD, Gibbs sampling, etc.) on multi-node clusters. It consists of a … Continue reading →

Velox: Models in Action

To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused … Continue reading →

The Power of Choice in Data-Aware Cluster Scheduling

Shivaram Venkataraman, Aurojit Panda, Ganesh Anantharayanan, Michael Franklin, Ion Stoica
OSDI'14, Oct. 2014.

Tags: Approximate Query Processing, Big Data, dataflow, scheduling

GraphX: Graph Processing in a Distributed Dataflow Framework

Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica
OSDI, Oct. 2014.

Tags: Big Data, dataflow, Graphs, Graphx, query processing, spark

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data

Jiannan Wang, Sanjay Krishnan, Michael Franklin, Ken Goldberg, Tim Kraska, Tova Milo
SIGMOD, Jun. 2014.

Tags: Big Data, crowdsourcing, Data Cleaning, query processing, Sampling

Fine-grained Partitioning for Aggressive Data Skipping

Liwen Sun, Michael Franklin, Sanjay Krishnan, Reynold Xin
ACM SIGMOD, Jun. 2014.

Tags: algorithms, Big Data, data warehouse, databases, partitioning, physical database design, spark

Alluxio (formerly Tachyon), a Memory Speed Virtual Distributed Storage System

As datasets continue to grow, storage and networking pose the most challenging bottlenecks for many workloads. To address the bottleneck, … Continue reading →

Concurrency Control for Machine Learning

Many machine learning (ML) algorithms iteratively transform some global state (e.g., model parameters or variable assignment) giving the illusion of … Continue reading →

A General Bootstrap Performance Diagnostic

Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan
ACM KDD 2013, Aug. 2013.

Tags: Big Data, BlinkDB, Bootstrap, Error Bars

GraphX: A Resilient Distributed Graph System on Spark

Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica
GRADES (SIGMOD workshop), Jun. 2013.

Tags: Big Data, graph

Generalized Scale Independence Through Incremental Precomputation

Michael Armbrust, Eric Liang, Tim Kraska, Armando Fox, Michael Franklin, David Patterson
ACM SIGMOD Conference, Jun. 2013.

Tags: Big Data, Materialized Views, PIQL, SCADS, scale independence

Fast and Interactive Analytics over Hadoop Data with Spark

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
Usenix ;login:, Aug. 2012.

Tags: BDAS, Big Data, hadoop, spark

Shark: SQL and Rich Analytics at Scale

Reynold Xin, Joshua Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
ACM SIGMOD Conference, Jun. 2013.

Tags: Big Data, spark, SQL, Warehouse

MLbase: Distributed Machine Learning Made Easy

Implementing and consuming Machine Learning techniques at scale are difficulttasks for ML Developers and End Users. MLbase is a platform … Continue reading →

AMP Camp Follow-up and Preview of What’s Next

Posted on September 21, 2012 by Andy Konwinski

In August, we hosted the first AMP Camp “Big Data bootcamp” and it was a huge success, with a sold-out … Continue reading →

DNA Processing Pipeline

Another effort related to genomics underway at the AMP Lab involves developing a variant calling pipeline. Variant calling is the … Continue reading →

DNA Sequence Alignment with SNAP

As the cost of DNA sequencing continues to drop faster than Moore’s Law, there is a growing need for tools … Continue reading →

Cancer Tumor Genomics: Fighting the Big C with the Big D

It may have been true once that expertise in computer science was needed only by computer scientists. But Big Data … Continue reading →

A Scalable Bootstrap for Massive Data

Ariel Kleiner, Ameet Talwalkar, Purna Sarkar, Michael Jordan
Journal of the Royal Statistical Society, Series B, Dec. 2011.

Tags: Big Data

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (Best Demo Award)

Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Demonstration Paper, ACM SIGMOD Conference , May. 2012.

Tags: BDAS, Best Paper Award, Big Data, Shark, spark, SQL

Real Life Datacenter Workloads

How do we ensure that AMP Lab works on important and immediate problems? One of many ways is to look … Continue reading →

Traffic jams, cell phones and big data

Posted on January 18, 2012 by Timothy Hunter

(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly) It is well known that big data processing is becoming … Continue reading →

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (Best Paper Award)

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
NSDI 2012, Apr. 2012.

Tags: BDAS, Best Paper Award, Big Data, spark, storage

BLB: Bootstrapping Big Data

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving very large … Continue reading →

Divide-and-Conquer Matrix Factorization

Lester Mackey, Ameet Talwalkar, Michael Jordan
Neural Information Processing Systems (NIPS), Jan. 2012.

Tags: Big Data, matrix factorization

AMP Lab – UC Berkeley

Tag Archives: