Crowdsourcing and bursts in human dynamics

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Crowdsourcing labor markets like Amazon Mechanical Turk are now being used by several companies. For example castingwords uses Mechanical Turk for transcribing audio and video files. Card Munch uses the crowd to insert the contents of business cards into iPhone address books.

But how long will it take for Card Munch to convert a batch of 200 cards into a digital address book, or how long should castingwords wait to get back the transcription of 100 audio files from Mechanical Turk?

This is a stochastic process. First, it depends on how many workers (a.k.a Turkers) are available on Mechanical Turk. Secondly, it depends on what other competing works are available on this market. Nobody wants to transcribe a long audio file for $6 if they can work on easier $6 tasks.  In other words everybody has some kind of a utility function in mind that they like to maximize. Some want to earn more money in a shorter time; others may want to work on interesting jobs. Regardless of the form of this utility function it is less likely that workers will work on your task if there are more appealing tasks on the market. That’s what we are as humans, selfish utility maximizers!

An interesting property of the Mechanical Turk market is that workers select their tasks from two main queues: “Most recently posted” and “Most HITs available”. This results in an interesting behavior in terms of completion times for a task: They are distributed according to a power-law: While many tasks get completed relatively quickly, some others starve. When you post your task with 100 subtasks, say 100 business cards that need to be digitized, the task appears on top of the “new tasks” list, more people notice that and more people work on it. You may get 70 of your subtasks completed in few minutes. But as other requesters post their tasks on the market, your task goes down in the new tasks list and fewer people see it. The completion rate goes down dramatically. By the time your task drops to the fifth page of the list you will be lucky if people notice your task at all. This is why many consider canceling their task at this point and repost them just to appear on the first page of task lists.

When we see such priority queues (in this case the list of “new tasks”) we see similar behaviors of task completion times. In fact we all have an intuitive understanding of the effect of the priority queue. We click on the top link on the first page of Google search results more often that a link that appears on the 10th page. See Barabasi’s article in Nature on this phenomenon (also see the supplementary material on this website in case you are interested in the underlying queuing theory).

Human behavior is hard to predict and to answer our completion time question we need to unlock this fascinating long tail complexity. This is an ongoing research project in the community. And so far I have approached this problem from two different perspectives: One, using a statistical technique called survival analysis (specifically, the Cox proportional hazards model), I’ve also explored the idea of Turkers as utility maximizes. Interested readers can refer to this paper. I’d like to highlight that this work is still at the early stages, and I will write more about the extensions of these works in future blog posts.

 

 

Using program analysis to tame big data configuration

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

One of the major problems with big data systems is that they can be devilishly hard to manage and configure. One approach we’re exploring in the AMP lab is to use program analysis as a tool to automatically reason about programs and infer what configuration options they have. The analysis takes a Java program as input and spits out a list of all the configuration options, where in the code they are read, and what type they likely have.

I’ve been spending my summer at Cloudera [a lab affiliate] applying this work in the real world. It’s been successful. Aspects of it are being used in several ways. Perhaps most importantly, we use it to analyze customer configurations for problems. We’ve also used it to find configuration-related bugs in Hadoop, such as undocumented and wrongly documented options. It was also used to provide a central listing of options to guide developers of Cloudera’s add-on management tools.

A full description is available on the Cloudera blog. See also the paper “Static Extraction of Program Configuration Options”, presented at ICSE 11; this past May.

The importance of provenance for cancer genomics

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?
(Co-written with Michael Armbrust)

In order to motivate our research agenda, the AMP Lab has partnered with several other groups that have “big data” problems including traffic prediction, urban simulation, and cancer genomics.  For example, we are collaborating with Taylor Sittler, a computer-savvy pathology resident at UCSF, addressing problems like providing efficient access and storage of sequencing data in addition to extracting insights from a variety of clinical datasets.  While this collaboration is still in the early phases, students in the lab have already started to port segments of the sequencing pipeline to take advantage of the computation speed provided by Spark, and we are planning to offer a graduate seminar on the topic this fall.

Genomics data comes from a vast number of sources, varies substantially in schema and semantics, and is being produced at an ever increasing rate (for example, the TCGA and the 1000 Genomes project).  Additionally, the time to answer for questions about what treatment a patient should receive must be very short if patients are to be given effective care.  These factors make working with genomic data a great fit for the goals of the lab.

We recently came across an article in the New York Times about a disappointment in research on cancer genomics.  Some researchers at Duke University created tests to determine which treatments a particular patient should receive based on the tumor’s genetics.  Statisticians asked to check into the data analysis found glaring errors.  However, no one paid much attention to their findings until the community learned that the research’s lead author had falsified his resume.  The research was later discredited.  Tragically, a patient treated according to the tests died; her family ended up suing Duke.

For the AMP Lab, we have several takeaways from this article.  We can’t expect our application partners to blindly trust the results of any big data analysis.  The article quotes a senior scientist who states that because the workflows used in genomic analysis alprazolam online are complex and involve different types of expertise (biology and systems), no single person can be expected to understand the whole thing.  Another scientist, one of the statisticians involved in discrediting the Duke work, points out that intuition often fails when looking at such large and diverse datasets.

Therefore, simply providing an answer (eg, treat patient X with drug Y) is insufficient; we also need to support our conclusions.  In this case, the Duke scientists were dealing with a rather small amount of data; in fact, it could all fit in a spreadsheet, which enabled the statisticians asked to check the work to find errors pretty easily.  Spreadsheets are actually a success story in this case, as formulae make it easy to see how answers are derived from base data.

However, with big, heterogeneous data and complex workflows, it won’t be so easy to find errors.  Thus, one of the goals of the AMP Lab is to develop tools that help our users debug dataflows and record result provenance, while remaining as straightforward to use as if the data were small (ie, if it could fit in a spreadsheet).

AMP – A holistic approach to making sense of Big Data

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Big Data is all the rage right now – articles in newspapers and the popular press, a steady stream of books about “Super-crunchers” and “Numerati”, conferences on Big Data, new and newly-acquired companies, and the appearance of a class of workers called Data Scientists are just some of the move visible manifestations of this trend.

Why all the excitement?   First of all, the amount of data available to be analyzed continues to grow as more and more human activity moves on-line.   Second, continued cost improvements in processing power and storage, advances in scalable software architectures, and the emergence of cloud computing are enabling organizations of all types to consider trying to work with this data.  This combination of data and analytics capacity hold the promise to enable data-driven decision making – a holy grail of information technology from its earliest days.

Much of the industry and research activity around Big Data is focused on scalability to address the increasing volume of data.   But data size is a problem for which there are known solutions.   The real challenge problems in data analytics come from other causes.   Some of the hardest problems stem from data heterogeneity.  As more and more information is collected from more and more places, making sense of that information in a unified way becomes increasingly difficult.

Another source of problems comes not from the data but from the queries or questions asked over that data.  Queries are often exploratory in nature and may be under- or over- specified.  Furthermore, as one looks for more and more needles in bigger and bigger haystacks, it becomes possible to find evidence to support all sorts of hypotheses, even if that evidence is the result of randomness rather than actual causality.   Finally, as we cast a wider net for both data and queries, we must deal with increasing ambiguity and must support decision making and predictions over incomplete, uncertain data.

Addressing these problems requires more than simply scaling up existing systems and data analysis algorithms.  Instead, what is needed is a rethinking about how analytics platforms are built.   Stepping back a bit, the key resources we have for making sense of data are Algorithms, in the form of statistical machine learning and other data analytics methods; Machines, in the form of cluster and cloud computing, and People, who not only provide data and pose questions, but who can also help out with understanding the data and can address the “ML-Hard” questions that are not currently solvable by computers alone.

AMP DimensionsMost existing data-centric systems emphasize one of the three key resources.   The premise behind the AMPLab is that a more holistic approach is needed.   In other words, we want to attack the Big Data analytics problem by developing systems that more closely blend the resources of Algorithms, Machines, and People.

One way to think of this is as a “co-design” approach.   Rather than inventing new algorithms or designing new computing platforms in isolation, we are considering all of these resources together.   For example, the CrowdDB system invokes crowdsourcing to find missing database data, perform entity resolution, and to perform subjective comparisons.   CrowdDB is a combination of M (a traditional relational database system) and P (human input, obtained through Amazon Mechanical Turk and our own mobile platform) with a bit of A thrown in to do data cleaning and quality assessment.   As another example, recent progress by our machine learning group on parallelizing the Bootstrap method for confidence estimation is an instance of designing a new Algorithmic approach to cope with the emerging Machine environment of cluster computing.

The challenge of this type of integrated work is to get people whose expertise typically lies along one of the A, M, and P dimensions to be able to effectively work together to solve big problems.   We believe that the AMPLab is uniquely positioned to enable this collaboration, and as such, we look forward to contributing to the progress in making sense of Big Data.

Welcome to AMP Blab

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Welcome to the AMPLab blog.  We’ll be using this space to describe our on-going research and results; to give reports on events in our lab, department, and elsewhere at Berkeley; and to discuss trends and news items related to Big Data in general.   We’ll have contributions from all the researchers in the lab, and hopefully some of our application and industrial partners as well.   We look forward to being a part of the growing, global conversation about data analysis in research and in practice.