A major challenge in many large-scale machine learning tasks is to solve an optimization objective involving data that is distributed … Continue reading
Tag Archives:
CoCoA: A Framework for Distributed Optimization
Succinct on Apache Spark: Queries on Compressed RDDs
tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random … Continue reading
Succinct: Enabling Queries on Compressed Data
Web applications and services today collect, store and analyze an immense amount of data. As data sizes continue to grow, the … Continue reading
KeystoneML
KeystoneML is a research project exploring techniques to simplify the construction of large scale, end-to-end, machine learning pipelines. KeystoneML is designed around … Continue reading
Splash: Efficient Stochastic Learning on Clusters
Splash is a general framework for parallelizing stochastic learning algorithms (SGD, Gibbs sampling, etc.) on multi-node clusters. It consists of a … Continue reading
Velox: Models in Action
To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused … Continue reading
Alluxio (formerly Tachyon), a Memory Speed Virtual Distributed Storage System
As datasets continue to grow, storage and networking pose the most challenging bottlenecks for many workloads. To address the bottleneck, … Continue reading
GraphX: Large-Scale Graph Analytics
Increasingly, data-science applications require the creation, manipulation, and analysis of large graphs ranging from social networks to language … Continue reading
Concurrency Control for Machine Learning
Many machine learning (ML) algorithms iteratively transform some global state (e.g., model parameters or variable assignment) giving the illusion of … Continue reading
MLbase: Distributed Machine Learning Made Easy
Implementing and consuming Machine Learning techniques at scale are difficulttasks for ML Developers and End Users. MLbase is a platform … Continue reading
AMP Camp Follow-up and Preview of What’s Next
In August, we hosted the first AMP Camp “Big Data bootcamp” and it was a huge success, with a sold-out … Continue reading
DNA Processing Pipeline
Another effort related to genomics underway at the AMP Lab involves developing a variant calling pipeline. Variant calling is the … Continue reading
DNA Sequence Alignment with SNAP
As the cost of DNA sequencing continues to drop faster than Moore’s Law, there is a growing need for tools … Continue reading
Cancer Tumor Genomics: Fighting the Big C with the Big D
It may have been true once that expertise in computer science was needed only by computer scientists. But Big Data … Continue reading
Real Life Datacenter Workloads
How do we ensure that AMP Lab works on important and immediate problems? One of many ways is to look … Continue reading
Traffic jams, cell phones and big data
(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly) It is well known that big data processing is becoming … Continue reading
BLB: Bootstrapping Big Data
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving very large … Continue reading