Turning up the volume on big data

Scale, immediacy, and continuous improvement

The datacenter as a computer

Leveraging human intelligence and activity

Working at the intersection of three massive trends: powerful machine learning, cloud computing, and crowdsourcing, the AMPLab is integrating Algorithms, Machines, and People to make sense of Big Data. We are creating a new generation of analytics tools to answer deep questions over dirty and heterogeneous data by extending and fusing machine learning, warehouse-scale computing and human computation. We validate these ideas on real-world problems including participatory sensing, urban planning, and personalized medicine with our application and industrial partners.

Machine learning (ML) provides computational procedures for turning data into information and knowledge. While it is useful to view ML as a toolbox that can be deployed for many data-centric problems, our long-term goal is more ambitious—we want to develop ML as a full-fledged engineering discipline. To accomplish this, we expand the classic notion of resource management beyond computation and space to include the data itself. Our approach is to integrate confidence estimation into all aspects of the data analytics process while increasing scale and efficiency.

Many claim that the “datacenter is the new computer” but the reality is that datacenters do not provide the key services one expects of a modern computer: high-level analytics languages for data manipulation, programming languages for a wide range of tasks, file systems for efficient and scalable data access, operating systems for isolation and resource management, and debuggers to find and correct programming errors. We are developing datacenter-scale implementations of these components, so that using a datacenter for analytics becomes as easy as using a computer today.

People will play a key role in data-intensive applications – not simply as passive consumers of results, but as active providers and gatherers of data, and to solve ML-hard problems that algorithms on their own cannot solve. With crowdsourcing, people can be viewed as highly valuable but unreliable and unpredictable resources, in terms of both latency and answer quality. They must be incentivized appropriately to provide quality answers despite varying expertise, diligence and even malicious behavior. The AMPLab is addressing these issues in all phases of the analytics lifecycle.

Events

  • AMPLab Open House at EECS BEARS Conf. 2/23/12 (Registration Required)

    More Info »

  • [AMPLab Cloud Seminar] 2/27/12 11am, Chris Olston, Google on Programming and Debugging Large-Scale Data Workflows

    More Info »

  • [Database Seminar] 2/21/12 11:30 Peter Bailis, UC Berkeley on PBS: Probabilistically Bounded Staleness

    More Info »

More Events »