How do we ensure that AMP Lab works on important and immediate problems? One of many ways is to look at real life workloads from our industry partners and their customers.
AMP Lab is fortunate to have under our analysis the activity logs of real life, front line systems of up to 1000s of nodes servicing 100s of PB of data. As of early 2012, these logs include Hadoop, Dryad, enterprise network storage, and other similar systems, from Cloudera, Facebook, Google, Microsoft, Netapp, and Twitter.
We view it as a part of our academic contribution to
- Scientifically understand these workloads,
- Improve large scale systems according to empirical behavior,
- Share our insights with the research community,
- Help our industry partners innovate on design and performance, and ultimately
- Train ourselves to be knowledgeable on using big data to improve the society at large.
Selected publications heavily influenced by real-life workloads (in reverse publication order):
- Interactive Analytical Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads (in press)
- Understanding TCP Incast and Its Implications for Big Data Workloads
- PACMan: Coordinated Memory Caching for Parallel Jobs
- Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis
- Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis.
- The Case for Evaluating MapReduce Performance Using Workload Suites
- Disk-Locality in Datacenter Computing Considered Irrelevant
- Design and Evaluation of a Real-Time URL Spam Filtering Service
- Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- Dominant Resource Fairness: Fair Allocation of Multiple Resources Types
- Reining in the Outliers in MapReduce Clusters using Mantri
- Privacy Settings in Context: A Case Study using Google Buzz