Primary goals: Provide better support for parallel and high-performance applications Enable more predictable performance Scale the operating system to a large number of cores Akaros is based on a few related ideas: Allow processes to have control of their resources … Continue reading →
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving very large datasets, the computation of bootstrap-based quantities can be extremely computationally demanding. As an alternative, we introduce the Bag of Little … Continue reading →
It may have been true once that expertise in computer science was needed only by computer scientists. But Big Data has shown us that’s no longer the case. The war against cancer is increasingly moving into cyberspace, and it is … Continue reading →
An energy bug is a system behavior that causes unexpectedly heavy use of energy and which is not intrinsic to providing the desired functionality. We aim to identify and help diagnose energy bugs in mobile devices by performing distributed, low-overhead sampling, aggregating these data, and applying statistical methods to identify the apps, … Continue reading →
CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages … Continue reading →
Divide-Factor-Combine (DFC) is a parallel divide-and-conquer framework for noisy matrix factorization problems, e.g., matrix completion and robust matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization … Continue reading →
Another effort related to genomics underway at the AMP Lab involves developing a variant calling pipeline. Variant calling is the process of translating the output of DNA sequencing machines, short reads, to a summary of the unique characteristics of the … Continue reading →
As the cost of DNA sequencing continues to drop faster than Moore’s Law, there is a growing need for tools that can efficiently analyze larger bodies of sequence data. By mid-2013, sequencing a human genome is expected to cost $1000, … Continue reading →
MDCC (Multi-Data Center Consistency) is a project to efficiently achieve stronger consistency for databases deployed across several different data centers. MDCC has two main components: A new Service-level-objective aware programming model which empowers the developer with more information about the … Continue reading →
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI,Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications. Mesos is open source in the Apache Incubator. More information … Continue reading →
PIQL is a SQL like language that uses a new scale independent optimization strategy to execute relational queries while maintaining the performance predicability and scalability provided by distributed key/value stores. Scale independent optimization guarantees that all queries will perform a bounded number of storage operations … Continue reading →
How do we ensure that AMP Lab works on important and immediate problems? One of many ways is to look at real life workloads from our industry partners and their customers. AMP Lab is fortunate to have under our analysis … Continue reading →
We have ported Apache Hive, a large-scale data warehouse solution, to run queries on Spark, a high-speed in-memory cluster computing framework. The resulting system, Shark (Hive on Spark), can answer Hive QL queries 40 times faster than Hive without modification … Continue reading →
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into … Continue reading →
The Sparrow project introduces a distributed cluster scheduling architecture which supports ultra-high throughput, low latency task scheduling. By supporting very low-latency tasks (and their associated high rate of task turnover), Sparrow enables a new class of cluster applications which analyze … Continue reading →