Turning up the volume on big data
Scale, immediacy, and continuous improvement
The datacenter as a computer
Leveraging human intelligence and activity
Working at the intersection of three massive trends: powerful machine learning, cloud computing, and crowdsourcing, the AMPLab is integrating Algorithms, Machines, and People to make sense of Big Data. We are creating a new generation of analytics tools to answer deep questions over dirty and heterogeneous data by extending and fusing machine learning, warehouse-scale computing and human computation. We validate these ideas on real-world problems including participatory sensing, urban planning, and personalized medicine with our application and industrial partners.
Machine learning (ML) provides computational procedures for turning data into information and knowledge. While it is useful to view ML as a toolbox that can be deployed for many data-centric problems, our long-term goal is more ambitious—we want to develop ML as a full-fledged engineering discipline. To accomplish this, we expand the classic notion of resource management beyond computation and space to include the data itself. Our approach is to integrate confidence estimation into all aspects of the data analytics process while increasing scale and efficiency.
Many claim that the “datacenter is the new computer” but the reality is that datacenters do not provide the key services one expects of a modern computer: high-level analytics languages for data manipulation, programming languages for a wide range of tasks, file systems for efficient and scalable data access, operating systems for isolation and resource management, and debuggers to find and correct programming errors. We are developing datacenter-scale implementations of these components, so that using a datacenter for analytics becomes as easy as using a computer today.
People will play a key role in data-intensive applications – not simply as passive consumers of results, but as active providers and gatherers of data, and to solve ML-hard problems that algorithms on their own cannot solve. With crowdsourcing, people can be viewed as highly valuable but unreliable and unpredictable resources, in terms of both latency and answer quality. They must be incentivized appropriately to provide quality answers despite varying expertise, diligence and even malicious behavior. The AMPLab is addressing these issues in all phases of the analytics lifecycle.
Events
-
AMPLab Open House at EECS BEARS Conf. 2/23/12 (Registration Required)
More Info »
-
[AMPLab Cloud Seminar] 2/27/12 11am, Chris Olston, Google on Programming and Debugging Large-Scale Data Workflows
More Info »
-
[Database Seminar] 2/21/12 11:30 Peter Bailis, UC Berkeley on PBS: Probabilistically Bounded Staleness
More Info »
Featured Project
Performance-Insightful Query Language (PIQL)
Web-scale systems are susceptible to “Success Disasters”, where rapid growth in popularity and data volume can break a previously working data architecture. Unfortunately, a primary culprit is the data independence provided by declarative systems such as relational databases. High-level programming interfaces, while facilitating agile development, have the negative effect of hiding latent scalability bugs from the programmer. This obfuscation of performance problems is one factor that has led some developers to eschew relational databases in favor of imperative code written against distributed key/value stores. Doing so, however, results in the loss of the many important benefits of data independence and abstraction.
As an alternative, we propose PIQL (pronounced "pickle"), a declarative system that provides a new form of data independence we call scale independence. The PIQL compiler calculates an upper bound on the number of key/value store operations that will be performed for any query. Coupled with a Service Level Objective (SLO) compliance prediction model and PIQL’s scalable database architecture, these bounds make it easy for developers to build success-tolerant systems that support an arbitrarily large number of users while still meeting their performance goals.
The PIQL language and its implementation are described in this paper, which will appear in VLDB 2012.




















