An update on the “P” in AMP: Clams, Coins, and Careful Cleaning

Daniel Haas

Many of the projects in the AMPLab drive systems for large-scale data analysis to ever-increasing sophistication, speed, and scale. No matter how efficient the algorithm or how powerful the system that runs it, however, the human analyst remains a fundamental in-the-loop component of a data processing pipeline. End-to-end data analysis involves frequent iteration through human-mediated processing steps such as formulating queries or machine learning workflows, enriching and cleaning data, and manually examining, visualizing, and evaluating the output of the analysis. In some of our recent work in support of the “P” in AMP, the goal is to make such operations scalable (via approximation, machine learning, and the involvement of crowd workers on platforms like Amazon’s Mechanical Turk) and low-latency (via adapting techniques inspired by the existing distributed systems literature for human workers).

One such effort focuses on data labeling, a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In our latest paper (which will appear in VLDB 2016), we introduce CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate CLAMShell in simulation and on live workers on Amazon’s Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.

A related effort looks at the problem of identifying and maintaining good, fast workers from a theoretical perspective. The CLAMShell work showed that organizing workers into pools of labelers can dramatically speed up labeling time—but how does one decide which workers should be in the pool? Because workers hired on marketplaces such as Amazon’s Mechanical Turk may vary widely in skill and speed, identifying high-quality workers is an important challenge. If we model each worker’s performance (e.g. accuracy or speed) on a task as drawn from some unknown distribution, then quickly selecting a good worker maps well onto the most biased coin problem, an interesting statistical problem that asks how many total coin flips are required to identify a “heavy” coin from an infinite bag containing both “heavy” and “light” coins. The key difficulty of this problem lies in distinguishing whether the two kinds of coins have very similar weights, or whether heavy coins are just extremely rare. In this work, we prove lower bounds on the problem complexity and develop novel algorithms to solve it, especially in the setting (like on crowdsourcing markets) where we have no a priori knowledge of the distribution of coins. Our next steps will be to run our new algorithms on crowd workers and demonstrate that we can identify good workers efficiently.

Finally, we are investigating systems for data analysis with crowdsourcing directly built in. The SampleClean project uses a combination of statistics and the crowd to support a usable data cleaning system that links crowd workers to the BDAS stack via our declarative crowd framework AMPCrowd. Our recent demo paper (in VLDB 2015) describes the system, and our most recent work focuses on data cleaning in the presence of privacy concerns. Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, for consistent result estimation, all existing mechanisms for differential privacy assume that the underlying dataset is clean. Raw data are often dirty, and this places the burden of potentially expensive data cleaning on the data provider. In a project called PrivateClean, we explore differential privacy mechanisms that are compatible with downstream data cleaning. The database gives the analyst a randomized view of the data, which she can not only read but also modify, and PrivateClean reconciles the interaction between the randomization and data manipulation. Our work on PrivateClean is under revision for SIGMOD 2016.

In summary, the “P” in AMP is about providing experts with practical tools to make the manual parts of data analysis more efficient at scale. If you’re interested in learning more, reach out at or

Do Computer Scientists Hold the Key to Treating Cancer?


An ACM sponsored blog was published with this title in the Huffington Post today. Here is the link. What’s published accidentally isn’t the latest draft, which is below.

This ancient assassin, first identified by a pharaoh’s physician, has been killing people for more than 4600 years. As scientists found therapies for other lethal diseases—such as measles, influenza, and heart disease—cancer moved up this deadly list and will soon be #1; 40% of Americans will face cancer during their lifetimes, with half dying from it. Most of us ignore cancer until someone close is diagnosed, but instead society could zero in on this killer by recording massive data to discover better treatments before a loved one is in its cross hairs.

We now know that cancer has many subtypes, but they are all unlimited cell growth caused by problems in DNA. Some people are born with precarious DNA, and others acquire it later. When a cell divides, sometimes it miscopies a small amount of its DNA, and these errors can overwhelm a cell’s defenses to cause cancers. Thus, you can get it without exposure to carcinogens. Cigarettes, radiation, asbestos, and so on simply increase the copy error rate. Speaking figuratively, every time a cell reproduces, we roll the dice on cancers and risky behavior increases cancers’ chances.

Most cancer studies today use partial genomic information and have under 1000 patients. One wonders whether their conclusions wouldn’t improve if they used complete genomes and increased the number of patients by factors of 10-100.

Given cancer’s gravity and nature, shouldn’t scientists be able to decode full genomes inexpensively to fight this dreaded disease better informed? Now they can! The plot below shows the dropping cost of sequencing a genome since 2001.

Moore’s Law, which drives the information technology revolution, improved a 100 fold in 15 years, yet the cost to identify a genome has dropped 100,000 fold to $1000 per genome, which is considered the tipping point of affordability by many.

Dropping cost of sequencing

This graph should be a call to arms for computer scientists, as the war on cancer will require Big Data. If the 1.7 million Americans who will get cancer in 2016 were to have their healthy cells and tumor cells sequenced, it would yield 1 exabyte (1 billion times 1 billion bytes) of raw data. The UC Berkeley AMPLab —collaborating with Microsoft Research and UC Santa Cruz—joined the battle in 2011, which we announced in a New York Times essay. We have been championing cloud computing and open-source software ever since.

The good news is that the same technology that can decode cancer tumors can identify unknown pathogens, allowing software our collaboration developed to help save a life. A teenager went to medical specialists repeatedly and was eventually hospitalized for five weeks without a successful diagnosis. He was placed in a medically induced coma after developing brain seizures. In desperation, the doctors sent a spinal fluid sample to UCSF for genetic sequencing and analysis. Our program first filtered out the human portion of the DNA data, which was 99.98% of the original three million pieces of data, and then sequenced the remaining pathogen. In just two days total, UCSF identified a rare infectious bacterium. After treating the boy with antibiotics, he awoke and was discharged. Although our software is only part of this process, previously doctors had to guess the causative agent before testing for a contagious disease. Other hospitals and the Center for Disease Control now use this procedure.

The bad news that we must change is that genetic repositories are still a factor of 10-100 short of having enough cancer patients to draw statistically significant results. The reason we need so many patients is that there are many cancer subtypes and that the tumors of subtype are notoriously diversified; most are unique, so it takes numerous samples to make real progress. Here are obstacles to collecting that valuable data, despite the storage itself already being affordable and getting cheaper:

  • Who would pay? Like the chicken versus the egg debate, we don’t yet have conclusive data that show how genetic information leads to effective therapies for most cancers. Thus, despite lower costs, insurance companies won’t pay for sequencing. Although many believe it would yield bountiful insights, we can’t prove it.
  • If funding was found, would the hospitals share data? Some hospitals don’t share data to attract more patients and researchers, and many researchers consider the data private property—at least until they publish and often after—despite funding from our tax dollars. For example, even the editor-in-chief of the New England Journal of Medicine recently referred to interested outsiders as

    “research parasites” who should not “use the data to try to disprove what the original investigators had posited.”

  • Even if hospitals and researchers were willing, would they be allowed to share data? While a cancer database will likely lead to breakthroughs, and cancer patients often are eager to donate their data to help others, medical ethicists worry more about patient privacy. Consequently, cancer studies regularly restrict data access to the official investigators of the research grant.

As Francis Collins, Director of the National Institute of Health, said at the Davos meeting about accelerating progress on cancer:

“We need that Big Data to be accessible. It’s not enough to say that we are in a Big Data era for cancer. We also need to be in a Big Data access era.”

To make genomic data more accessible, the Global Alliance for Genomics and Health was founded in 2013 “to enable the responsible, voluntary, and secure sharing of genomic and clinical data.” While 375 organizations from 37 countries have joined, and its working groups are active, progress has been slow in actually getting organizations to share. Perhaps the main impact thus far is that the community now largely believes that such data will eventually be shared.

To make it so, not only does society need to find the funding and cut through the red tape to populate a million cancer genome repository, but we need to draft experts to design and build open-source software that leverages advances in cloud computing, machine learning, and privacy protection to make it useful. Recruitment should be easy, as there’s no more inspiring endeavor than helping save lives, including some you may know.

And the quicker it happens, the better we can fortify ourselves against this ancient assassin.

BRAS update: Air quality graphs and twitter feed


A couple of months ago, I had posted about the Bombay Real time Air quality Sensing (BRAS) project, which is using a Development Impact Lab (DIL) Explore grant to set up an air quality monitoring network in Bombay.

Thanks to some great work by Shubham Jain, we now have the SDB timeseries stack (Giles -> UPMU plotter -> BtrDB) installed and displaying historical air quality data from 9 locations in Bombay. The Dylos devices that we have sent over are still being calibrated, so these are the readings from the official network set up by the Indian Government (SAFAR). The readings are from the really expensive BAM monitors, so we are considering using them as ground truth for field calibration of Dylos sensors.

The official SAFAR website does not offer historical readings, but our server makes it possible to track and perform analysis on long-term trends.

Screen Shot 2016-01-25 at 4.04.01 PM

Although we have only been reading their streams for the past few months, we can already see some big spikes. Some of the spikes can be explained – the spike in almost every stream on Wed, 11th Nov was probably due to Diwali – while some of them, like the giant spike in the BKC stream on the night of Sat, 7th Nov or the one in the Mazagaon stream on 28th Nov, or the one in the Bhandup stream on 17th Jan, have no current explanation.

Screen Shot 2016-01-25 at 4.11.34 PM

We also notice that the streams appear to be periodically turned off, for example, from the night of Wed, 11th Nov to the night of Tuesday, 17th Nov. During that time, Shubham verified that there was no error with the script, the SAFAR server was in fact reporting the exact same value for that entire period. We see similar gaps around the New Year (27th Dec to 10th Jan).

Screen Shot 2016-01-25 at 4.13.38 PM

Shubham has also hooked these streams up to twitter, similar to the US embassy twitter feed in Beijing. The feed reports both the raw value, and the AQI value as per the Indian AQI standards developed by the Indian Central Pollution Control Board (CPCB). As we can see, the air quality in all areas of Bombay has consistently ranged between Poor and Very Poor. Please consider subscribing to the feed and also sharing it with your friends and family. It would be great to get followers from Bombay since they are the ones who are most directly affected. I would also like to put in a plug for the official SAFAR mobile app (android) which provides both current readings, and predictions for the air quality in upcoming days.

Screen Shot 2016-01-25 at 4.17.12 PM

Finally, Shubham has made the data that we have collected available through a REST API. This basically opens up the SAFAR data and makes it available for use by the general community.

Shubham is now focusing on calibrating the low cost Dylos monitors that we sent over. We propose to perform the initial calibration in a controlled setting under the guidance of Prof. Virendra Sethi from the IIT Bombay Environmental Engineering Department. Depending on what Prof. Sethi recommends, we may deploy the units at the same locations at the SAFAR units so that we can compare against the readings from the more expensive machines.

Finally, once we have determined the error parameters, we will do an initial deployment at IIT Bombay, and begin contacting colleges in Bombay to see which of them would like to join the air quality monitoring revolution. Stay tuned for further details…

This work was supported by the Development Impact Lab at UC Berkeley (USAID Cooperative Agreement AID-OAA-A-13-00002 and AID-OAA-A-12-00011), part of USAID’s Higher Education Solutions Network.

The New Yorker uses NEXT to crowd source the next caption contest winner

Kevin Jamieson

Each week, the New Yorker magazine runs a cartoon caption contest where readers are invited to submit a caption for a cartoon. And each week, Bob Mankoff, cartoon editor of the magazine, and his staff sort through thousands of submissions to find the funniest entry. To speed this process up and further engage readers, Mankoff and the New Yorker enlisted the help of NEXT, a research project that uses crowdsourced data and active learning algorithms, to help choose the winner.

Bob Mankoff explains how NEXT has changed how the winner is chosen:

Help choose this week’s winner:
Screen Shot 2016-01-11 at 4.20.28 PM

NEXT is a cloud based machine learning system I developed in my PhD  work at the University of Wisconsin – Madison with Prof. Rob Nowak and Lalit Jain that I continue to work on in my postdoc in the AMP lab. The adaptive data-collection algorithms in NEXT decide which New Yorker cartoon captions to ask participants to judge based on the observation that even after a small number of judgments, there are some captions that are clearly not funny. Consequently, our algorithms automatically stop requesting judgments for the unpromising entries and focus on trying out the ones that might get a laugh. With active learning algorithms like this, the winner can be determined from far fewer total judgments and with greater certainty than using standard crowdsourcing methods that collect an equal number of judgments for every caption (regardless of how good or bad).

To learn more about NEXT and how it can help with your application, visit our project page:

Looking Back at AMP Year 5

Michael Franklin

It’s that time when we take a look back at what we’ve done over the past year and look forward to next year.  For AMPLab, it’s a particularly good point at which to do this because we are wrapping up the 5th year of what we originally intended to be a 5 year project.   As you may know, in 2012 we received a 5-year “Expeditions in Computing” Award from the National Science Foundation, so we extended our project to run until the end of 2016.   Thus, we’re heading into the final full year of the AMPLab project, and we’re working on planning what we want to do next.   We’ll be saying more about those plans as they develop, and those of you attending our Winter retreat meeting in Lake Tahoe next month will contribute to discussions on the “Next Lab”.   In the meantime, let’s take a quick look at what AMPLab accomplished in 2015.

BDAS Software – We continued to innovate around our core platform, the Berkeley Data Analytics System (BDAS).  This past year we released important new components of BDAS such as the KeystoneML machine learning pipeline system, the Succinct compressed storage and search system, Splash for parallelizing stochastic leaning algorithms, and SampleClean/AMPCrowd for human-in-the-loop data cleaning and labeling.  We’ve also made improvements to GraphX, MLlib, and Tachyon, among others.

Students – Our students and their research continued to be recognized with awards at top conferences and elsewhere.  One result that we are extremely proud of is that 2 of the 3 winners of this year’s ACM Dissertation Award were from AMPLab: Matei Zaharia and John Duchi.  These awards recognize outstanding Ph.D. work in Computer Science from all of the CS departments worldwide.  It is rare to have two winners from the same department. It is unheard of to have two winners from the same lab.  We also had some best paper awards and a CACM Research Highlight selection as noted in the news section of the AMPLab web site.  And of course, we continued to publish papers in the top conferences in Systems, Databases, Machine Learning, Networking, etc.   Have a look at our Publications page to see a pretty impressive list.  Our graduates received job offers from all the top Academic institutions in CS and of course, are in tremendous demand by companies of all sizes.

Industry Impact – AMPLab-born software is driving innovation in the Big Data industry.  Spark, Mesos, and Tachyon all have large groups of contributors and are used in production across a wide range of industries.  A recent salary survey by O’Reilly indicated that knowledge of Spark provided the highest increase in median salary across all of the Big Data technologies they studied, providing a bigger boost even than getting a Ph.D. (we try not to emphasize this point with our students!).  AMPLab sponsors have made large bets based on our software, including recent announcements from IBM, AWS, SAP and others.  Another interesting factoid – at the start of the year, listed an impressive 43 Apache Spark meetup groups around the world with a total of over 12,000 members.   As of this writing (less than a year later), there are 132 groups with 53,520 members.  There are now Spark meetups on every continent except (as far as we know) Antarctica.

Big Data and Data Science – AMPLab also played an important role in new research initiatives on campus and nationally.  For example, Berkeley recently was named to host the NSF West Big Data Innovation Hub and AMPLab will anchor a large part of Berkeley’s involvement in the hub.  Also, Lawrence Berkeley National Lab, in conjunction with Cray, is integrating BDAS with more traditional High Performance Computing infrastructure.  The convergence of Big Data and HPC is a key pillar of the National Strategic Computing Initiative recently announced by President Obama.

AMPCamp and Beyond – We continue to host successful outreach events such as AMPCamp 6, which was held earlier in November and AMPCamps in Shanghai (hosted by Intel) and Beijing (hosted by Microsoft).  AMPLab faculty have spoken at Davos, published in Science, and have opined on Big Data topics in a host of major media outlets.

The above is just a sampling of what we did during 2015 -please visit the AMPLab web site to dig deeper and keep up with the latest developments.

Best wishes from AMPLab for a happy, healthy and productive New Year.

Passing the Baton


I’m turning 68 today; 4 of my 6 Ph.D. students are graduating; and the start of the 5-year successor to the AMP and ASPIRE Labs is imminent. Thus, I’ve decided that the time is right to retire from UC Berkeley to allow a recent PhD with a fresh perspective to fill my position in the next great project.

While I’m open to interesting new challenges, starting in July my tentative plan (which sounds pretty good) is graduating my remaining students while continuing to coach interested faculty, revise textbooks, consult, travel, play soccer on Sundays, and attend the free faculty lunch on Mondays.

Reflecting upon the past 4 decades, when I joined UC Berkeley in 1976, we clearly trailed what was then called the “Big 3”: Stanford, MIT, and CMU.

Today, we don’t.

I believe we’ve leapfrogged the competition because of:
• Our practice of attracting and nurturing great young faculty;
• Our radical teamwork on research projects, which no other top program enjoys; and
• Our tradition of working in ways that are best for the department versus for our area or for ourselves.

We’ve advanced not only without sacrificing undergraduate education, we’ve actually increased both the number of students per capita and the quality of their education, thereby fulfilling the UC mission of helping Californians to achieve the American dream. That mission enabled my 2 sons, my 2 sisters, and me to earn 9 degrees from 5 UC campuses.

The extraordinary UC Berkeley students, faculty, and staff clearly made my career. I thank them for 40 years of inspiration and collaboration.

Succinct on Apache Spark: Queries on Compressed RDDs

Rachit Agarwal

tl;dr Succinct is a distributed data store that supports a wide range of point queries (e.g., search, count, range, random access) directly on a compressed representation of the input data. We are very excited to release Succinct as an Apache Spark package, that enables search, count, range and random access queries on compressed RDDs. This release allows users to use Apache Spark as a document store (with search on documents) similar to ElasticSearch, a key value interface (with search on values) similar to HyperDex, and an experimental DataFrame interface (with search along columns in a table). When used as a document store, Apache Spark with Succinct is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, and over 75x faster than native Apache Spark.

Succinct on Apache Spark Overview

Search is becoming an increasingly powerful primitive in big data analytics and web services. Many web services support some form of search, including LinkedIn searchTwitter search, Facebook search, Netflix search, airlines, hotels, as well as services specifically built around search — Google, Bing, Yelp, to name a few. Apache Spark supports search via full RDD scans. While fast enough for small datasets, data scans become inefficient as dataset become even moderately large. One way to avoid data scans is to implement indexes, but can significantly increase the memory overhead.

We are very excited to announce the release of Succinct as an Apache Spark package, that achieves a unique tradeoff — storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques. Succinct on Apache Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs. What differentiates Succinct on Apache Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression — all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD. 

In addition, Succinct on Apache Spark supports random access of records without scanning the entire RDD, a functionality that we believe will significantly speed up a large number of applications.

An example

Consider a collection of Wikipedia articles stored on HDFS as a flat unstructured file. Let us see how Succinct on Apache Spark supports the above functionalities:

// Import relevant Succinct classes
import edu.berkeley.cs.succinct._ 

// Read an RDD as a collection of articles; sc is the SparkContext
val articlesRDD = ctx.textFile("/path/to/data").map(_.getBytes)

// Compress the input RDD into a Succinct RDD, and persist it in memory
// Note that this is a time consuming step (usually at 8GB/hour/core) since data needs to be compressed. 
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succcinct.persist()

// SuccinctRDD supports a set of powerful primitives directly on compressed RDD
// Let us start by counting the number of occurrences of "Berkeley" across all Wikipedia articles
val count = succinctRDD.count("Berkeley")

// Now suppose we want to find all offsets in the collection at which “Berkeley” occurs; and 
// create an RDD containing all resulting offsets 
val offsetsRDD ="Berkeley")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the occurrences of “Berkeley”
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

Many more examples on using Succinct on Apache Spark are outlined here.



The figure compares the search performance of Apache Spark with Succinct against ElasticSearch and native Apache Spark. We use a 40GB collection of Wikipedia documents over a 4-server Amazon EC2 cluster with 120GB RAM (so that all systems fit in memory). The search queries use words with varying number of occurrences (1–10,000) with uniform random distribution across 10 bins (1–1000, 1000-2000, etc). Note that the y-axis is on log scale.

Interestingly, Apache Spark with Succinct is roughly 2.75x faster than Elasticsearch. This is when ElasticSearch does not have the overhead of Apache Spark’s job execution, and have all the data fit in memory. Succinct achieves this speed up while requiring roughly 2.5x lower memory than ElasticSearch (due to compression, and due to storing no additional indexes)! Succinct on Apache Spark is over two orders of magnitude faster than Apache Spark’s native RDDs due to avoiding data scans. Random access on documents has similar performance gains (with some caveats).

Below, we describe a few interesting use cases for Succinct on Apache Spark, including a number of interfaces exposed in the release. For more details on the release (and Succinct in general), usage and benchmark results, please see Succinct webpage, the NSDI paper, or a more detailed technical report.

Succinct on Apache Spark: Abstractions and use cases

Succinct on Apache Spark exposes three interfaces, each of which may have several interesting use cases. We outline some of them below:

  • SuccinctRDD
    • Interface: Flat (unstructured) files
    • Example application: log analytics
    • Example: one can search across logs (e.g., errors for debugging), or perform random access (e.g., extract logs at certain timestamps).
    • System with similar functionality: Lucene
  • SuccinctKVRDD

    • Interface: Semi-structured data
    • Example application: document stores, key-value stores
    • Example: 
      • (document stores) search across a collection of Wikipedia documents and return all documents that contain, say, string “University of California at Berkeley”. Extract all (or a part of) documents.
      • (key-value stores) search across a set of tweets stored in a key-value store for tweets that contain “Succinct”. Extract all tweets from the user “_ragarwal_”.
    • System with similar functionality: ElasticSearch
  • (An experimental) DataFrame interface
    • Interface: Search and random access on structured data like tables
    • Example applications: point queries on columnar stores
    • Example: given a table with schema {userID, location, date-of-birth, salary, ..}, find all users who were born between 1980 and 1985.
    • Caveat: We are currently working on some very exciting projects to support a number of additional SQL operators efficiently directly on compressed RDDs.

When not to use Succinct on Apache Spark

There are a few applications that are not suitable for Succinct on Apache Spark — long sequential reads, and search for strings that occur very frequently (you may not want to search for “a” or “the”). We outline the associated tradeoffs on Succinct webpage as well.

Looking Ahead

We at AMPLab are working on several interesting projects to make Succinct on Apache Spark more memory efficient, faster and more expressive. To give you an idea about what is next, we are going to close this post with a hint on our next post: executing Regular Expression queries directly on compressed RDDs. Stay tuned!

A new “Hub” for Big Data Innovation

Michael Franklin

The National Science Foundation has launched a new program creating four regional “Big Data Innovation Hubs”.   The goal is to foster collaborations among academic, government, industrial, and non-profit organizations to attack important societal problems where Big Data can help.   The regional aspect of the program is intended to focus collaborations on issues most relevant to particular parts of the country.  The 13-state West region Hub will initially focus on putting together efforts on managing natural resources and hazards, precision and personalized medicine, and urban environments.   We also will leverage and extend existing strengths in Big Data technologies and Data-Driven Science.

Berkeley has been chosen as one of the 3 co-leads of the West Region Hub and will be the home of the Hub’s Executive Director.   Dr. Meredith Lee recently joined Berkeley to serve in this role.   Meredith has tremendous expertise for this role, having most recently worked in Washington DC as a Science & Technology Policy Fellow at the U.S. Department of Homeland Security (DHS), focusing on topics such as graph analytics, risk assessment, machine learning, data visualization, and distributed computing.  There she worked with DHS organizations such as the Federal Emergency Management Agency (FEMA), U.S. Coast Guard, and the Transportation Security Administration (TSA) as well as a number of the key groups in Washington that coordinate and promote research in Data Science and Big Data.   Meredith is the perfect person to lead this new effort.

Existing Berkeley research centers such as AMPLab and the Berkeley Institute for Data Science will participate in Hub efforts and I have signed on as the Principal Investigator from Berkeley.   The Hub will provide opportunities to bring in people from around campus to participate in new collaborations.   We will be posting more information and announcing workshops and other events as we work with our colleagues at SDSC, UW and elsewhere to get this project off the ground.

More information can be found in this UC Berkeley Press Release.

Enabling air quality analysis using Berkeley software


Bombay Real time Air quality Sensing (BRAS) – a joint project between The Indian Institute of Technology, Bombay and the AMPLab – has recently been awarded one of 10 DIL Explore Grants, out of a field of 55 applications.

The Development Impact Lab (DIL) at Berkeley is a global consortium of research institutes, non-governmental organizations (NGOs), and industry partners committed to advancing international development through science and technology innovations. The Explore grants are intended to support early stage exploratory partnership building and research that combines technology innovation with social and economic research to solve international development challenges.

We plan to use this grant to set up a real-time air quality monitoring network in Bombay, use it to raise awareness of air quality and kickstart a discussion around policy changes required for improvement.

Air quality in developing economies has long been a source of concern, with a particular focus on countries like India and China, which together are home to over 35% of the world’s population. This concern has lead to individual data collection efforts such as the US Embassy Twitter feed for air quality in Beijing or Joshua Apte’s measurement of air quality while riding in an autorickshaw in New Delhi (if you have spent any time in India, Joshua’s youtube video is simultaneously familiar and horrifying). These individual efforts are now being scaled up with the deployment of governmental (SAFAR, China’s official network) or crowdsourced (aqicn, world air quality) air quality monitoring networks, but much work remains to be done in both data collection and analysis.

In particular, we want to:

  • experiment with the accuracy/cost tradeoff of lower cost sensors so that the sensors can be deployed over a wider area
  • make the data easily accessible via both visualizations and open APIs, to allow third parties to perform their own analyses on the data. Note that this is sadly lacking in current solutions.
  • engage with local colleges in three areas:
    • for easy sensor deployment – they have access to power, ethernet and students who can debug deployments.
    • for raising awareness – students can spread the information about their local air quality on social media
    • for kickstarting discussions around policy changes – students can perform analyses of data at their campus, compare their campus with other campuses, and run local challenges for improving air quality. These challenges can help motivate long-term policy changes.

The air quality monitors were dispatched from Berkeley at the end of summer. Thanks to Kattt, Boban, Carlyn and Jonahluis for helping with the ordering, packaging and shipping.

IMG_2227 IMG_2231

They were then stuck in Indian customs for almost a month before they made it to IIT Bombay. Here are some unboxing pics.

20150908_144821 20150908_144844

A final year undergraduate student at IIT Bombay, Shubham Jain, is working on this project for his BTech Capstone project, under the guidance of Principal Investigator Prof. Bhaskaran Raman. He is currently able to read the PM2.5 data from the sensors using a python script running on a Raspberry Pi.


Next, he is working on reading humidity and temperature streams, storing all of them in a timeseries sensor database, and converting the PM2.5 values into AQI values. The timeseries stack that he is using is the XBOS stack (Deckard -> UPMU plotter -> Giles -> BtrDb) which was developed by the UC Berkeley Software Defined Buildings group, and provides scalable storage, aggregation and visualization of scalar values from sensors, along with a rich metadata query interface.

We will post regular status updates as he makes more progress…

Silicon Valley is Migrating North, Part II


My Sept 21 blog was based on 25 of the 50 projected next “unicorns” (billion dollar evaluations) in the Bay Area according to a recent NY Times article. It provides visual evidence that promising new startups are locating much closer to Berkeley than to Palo Alto, an indicator of the (lesser-known) entrepreneurial success of UC Berkeley alumni. Berkeley even has its own incubator now.

A colleague pointed me to Fortune magazine’s recent list of already existing unicorns. There are 138 on that list, with 51 (37% overall) in the Bay Area and 30 (22%) just in San Francisco. They were founded between 1994 and 2015. To show change over time, I divided them into two equal time periods by year of founding: 1994 and 2004 (12 total and 3 in SF), and between 2005 and 2015 (39 total and 27 in SF). I then mapped the two lists and their geographic centers (see below). Note the number in the South Bay barely changed between decades (10 to 12), but the North Bay grew nearly 10X (3 to 28).

If I weight by valuations, the data are even more skewed; of all unicorns founded since 2005, the San Franciscans represent about 33% of the value in the world and nearly 90% of the value in the Bay Area. In the prior decade they were just 7% and 25%, respectively.

Whether we compare the number of unicorns or their valuations, and the past to the present or the present to the future, all signs point to Silicon Valley migrating north. San Francisco is its new heartland, plausibly in part because UC Berkeley is just a short BART trip across the bay.

Fortune's list of startup unicorns founded 1994-2004

Fortune’s list of Bay Area startup unicorns founded 1994-2004

Fortune's list of startup unicorns founded 2005-2014

Fortune’s list of startup unicorns founded 2005-2015