Silicon Valley is Migrating North

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Everyone knows that many entrepreneurs went to Stanford, but few know that UC Berkeley alumni are just as entrepreneurial.

For example, many would be surprised to learn that most startups are now locating much closer to Berkeley than they are to Palo Alto. A recent article in the New York Times listed the top 50 startups worldwide likely to become the next “unicorns” (billion dollar evaluation). Impressively, not only are half of them in the Bay Area, a third are in San Francisco by itself. (The rest: New York City 8, China 4, E.U. 3, Boston 2, Chicago 2, India 2, Southern California 2, Arlington 1, and Cape Town 1.)

I thought these potential unicorns might predict where future successful startups will locate, so to see where Silicon Valley is headed, I mapped the 25 Bay Area startups and their geographic center. It looks like the next heart of Silicon Valley is San Francisco. Histories of creating regions like Silicon Valley have often pointed to the critical role of great universities. So what great technology research university is closest to San Francisco? UC Berkeley, which is just a 20-minute BART trip away.

Potential unicorns in Bay Area according to NY Times

Location of 25 promising startups in the San Francisco Bay Area according to 8/23/15 NY Times

Getting a dozen 20 year olds to work together for fun and … social good

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

In Spring 2015, we ran an experimental course that explored the intersection of three trends related to the AMPLab mission and its role in educating students at the world’s best public university.

First, as the boundaries of computing grow, the scale and complexity of the problems that we tackle grows. This means that most impactful work is done collaboratively, and as we have seen in the AMPLab, across previously separate disciplines.

Second, as David Patterson pointed out in his NYT OpEd, there is need for long-term and sustained system building to tackle large-scale social problems that goes beyond one night hackathons.

Third, as David Corman from the NSF pointed out in his keynote at PerCom 2015, many students are now entering EECS departments with significant prior programming expertise, and we need to develop instructional techniques that can build on that base while challenging them and deepening their understanding.

In our course, we attempted to address all these issues by scaling up the”undergraduate assistant” concept 10 fold. We ran a small (~11 people) undergraduate research seminar that allowed students with skills across a variety of disciplines to work in small groups on various aspects of our e-mission research project.

In contrast to other undergraduate project classes, the students worked on three aspects of a single large-scale system, and the goal was to build a real, end to end system rather than a prototype.

The Center for Information Technology Research in the Interest of Society (CITRIS) ran a nice story on the class this week.
http://citris-uc.org/one-ph-d-students-mission-to-reduce-our-carbon-footprint/

Did it work? In many ways, yes. The exercise of preparing for the class – planning a syllabus, making the code ready for others to contribute, turning on automated tests – was a forcing function to help clarify the focus of the project. The students liked the class (rating: 6.6/7) and felt that it was useful (rating: 6.2/7). We built a system that works together end to end and generates results.

But it was definitely not perfect. The end to end system is startup/prototype quality. It is good enough to play around with possibilities, but it is not yet good enough to use for a real world evaluation.

The two biggest challenges were a reflection of those faced in any team setting, but magnified by the differences between the academic and industry environments.

First, this was just one class on students’ packed schedules and they had to prioritize it accordingly. But that made coordinating dependencies between components tricky – if students A and B were responsible for modules A and B, and B depended on A, but student A had a midterm one week, she was not going to work on module A until it was done. And by the time she could get to module A, it might be midterm time for student B.

Second, we had a strong Lake Woebegon effect – everybody needed to get an A. In a normal class setting, having some students turn in B or C level work does not affect others. But here, a student who tried to use B or C level work had to spend time cleaning it up before she could use it, which lead to resentment. This is true of all project classes, but was amplified here because all project teams were working on a single system.

Added in were a steep learning curve, different skill levels across the interdisciplinary teams, and a “testing takes too much time” attitude. We had to spend significant time and effort as the “glue” – establishing an overall framework structure, integrating all the disparate pieces at the end and forcing the development of at least minimal test cases.

If we were to do it again, it would be interesting to structure it as a year long course. That would help with the learning curve, and give us another semester to make the system robust and conduct a comprehensive evaluation. We would also plan for building more of the framework ourselves in order to reduce dependencies between students. It is an interesting pedagogical challenge to figure out how to do this while still giving students the experience of building something large and complex.

When might that happen? Not this Fall, for sure. How about next Spring or later? Only time will tell… :)

Doing good with Spark and Scala

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

At the Scala By the Bay Meetup they showed a video featuring Matt Massie, Frank Nothaft, Matei Zaharia, and me talking about helping people in general and those with cancer in particular by developing open-source tools that rely on Spark and Scala for genetic processing, such as SNAP and ADAM . Here is the NY Times oped they refer to in the video.

Announcing Splash: Efficient Stochastic Learning on Clusters

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

We are happy to announce Splash: a general framework for parallelizing stochastic learning algorithms on multi-node clusters. If you are not familiar with stochastic learning algorithms, don’t worry! Here is a brief introduction to their central role in machine learning and big data analytics.

What is a Stochastic Learning Algorithm?

Stochastic learning algorithms are a broad family of algorithms that process a large dataset by sequential processing of random subsamples of the dataset. Since their per-iteration computation cost is independent of the overall size of the dataset, stochastic algorithms can be very efficient in the analysis of large-scale data. Examples of stochastic algorithms include:

Stochastic learning algorithms are generally defined as sequential procedures and as such they can be difficult to parallelize. To develop useful distributed implementations of these algorithms, there are three questions to ask:

  • How to speed up a sequential stochastic algorithm via parallel updates?
  • How to minimize the communication cost between nodes?
  • How to design a user-friendly programming interface for Spark users?

Splash addresses all three of these issues.

What is Splash?

Splash consists of a programming interface and an execution engine. You can develop any stochastic algorithm using the programming interface without considering the underlying distributed computing environment. The only requirement is that the base algorithm must be capable of processing weighted samples. The parallelization is taken care of by the execution engine and is communication efficient. Splash is built on Apache Spark, so that you can employ it to process Resilient Distributed Datasets (RDD).

On large-scale datasets, Splash can be substantially faster than existing data analytics packages built on Apache Spark. For example, to fit a 10-class logistic regression model on the mnist8m dataset, stochastic gradient descent (SGD) implemented with Splash is 25x faster than MLlib’s L-BFGS and 75x faster than MLlib’s mini-batch SGD for achieving the same accuracy. All algorithms run on a 64-core cluster.

To learn more about Splash, visit the Splash website or read our paper. You may also be interested in the Machine Learning Packages implemented on top of Splash. We appreciate any and all feedback from early users!

splash-adagrad-plot

Announcing SparkR: R on Spark

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?
I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory.  SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.

Project History

The SparkR project was initially started in the AMPLab as an effort to explore different techniques to integrate the usability of R with the scalability of Spark. Based on these efforts, an initial developer preview of SparkR was first open sourced in January 2014. The project was then developed in the AMPLab for the next year and we made many performance and usability improvements through open source contributions to SparkR. SparkR was recently merged into the Apache Spark project and will be released as an alpha component of Apache Spark in the 1.4 release.

SparkR DataFrames

The central component in the SparkR 1.4 release is the SparkR DataFrame, a distributed data frame implemented on top of Spark.  Data frames are a fundamental data structure used for data processing in R and the concept of data frames has been extended to other languages with libraries like Pandas etc. Projects like dplyr have further simplified expressing complex data manipulation tasks on data frames. SparkR DataFrames present an API similar to dplyr and local R data frames but can scale to large data sets using support for distributed computation in Spark.

The following example shows some of the aspects of the DataFrame API in SparkR. (You can see the full example at https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85)

# flights is a SparkR data frame. We can first print the column 
# names, types 
flights
#DataFrame[year:string, month:string, day:string, dep_time:string, dep_delay:string, #arr_time:string, arr_delay:string,...

# Print the first few rows using `head`
head(flights)

# Filter all the flights leaving from JFK
jfk_flights <- filter(flights, flights$origin == "JFK")

# Collect the DataFrame into a local R data frame (for plotting etc.)
local_df <- collect(jfk_flights)

For a more comprehensive introduction to DataFrames you can see the SparkR programming guide at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html

Benefits of Spark integration

In addition to having an easy to use API, SparkR inherits many benefits from being tightly integrated with Spark. These include:

Data Sources API: By tying into Spark SQL’s data sources API SparkR can read in data from a variety of sources include Hive tables, JSON files, Parquet files etc.

Data Frame Optimizations: SparkR DataFrames also inherit all of the optimizations made to the computation engine in terms of code generation, memory management. For example, the following chart compares the runtime performance of running group-by aggregation on 10 million integer pairs on a single machine in R, Python and Scala (using the same dataset as https://goo.gl/iMLXnh). From the graph we can see that using the optimizations in the computation engine makes SparkR performance similar to that of Scala / Python.

Screen Shot 2015-06-08 at 7.11.27 PM

Scalability to many cores and machines: Operations executed on SparkR DataFrames get automatically distributed across all the cores and machines available on the Spark cluster. As a result SparkR DataFrames can be used on terabytes of data and run on clusters with thousands of machines.

Looking forward

We have many other features planned for SparkR in upcoming releases: these include support for high level machine learning algorithms and making SparkR DataFrames a stable component of Spark.

The SparkR package represents the work of many contributors from various organizations including AMPLab, Databricks, Alteryx and Intel. We’d like to thank our contributors and users who tried out early versions of SparkR and provided feedback.  If you are interested in SparkR, do check out our talks at the upcoming Spark Summit 2015.

 

Announcing KeystoneML

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

We’ve written about machine learning pipelines in this space in the past. At the AMPLab Retreat this week, we released (live, on stage!) KeystoneML, a software framework designed to simplify the construction of large scale, end-to-end, machine learning pipelines in Apache Spark. KeystoneML is alpha software, but we’re releasing it now to get feedback from users and to collect more use cases.

Included in the package is a type-safe API for building robust pipelines and example operators used to construct them in the domains of natural language processing, computer vision, and speech. Additionally, we’ve included and linked to several scalable and robust statistical operators and machine learning algorithms which can be reused by many workflows.

Also included in the code are several example pipelines that demonstrate how to use the software to reproduce recent academic results in computer vision, natural language processing, and speech processing. Here’s an example:

Sample SIFT and Fisher Vector based Image Classification pipeline, included in KeystoneML.

Sample SIFT and Fisher Vector based Image Classification pipeline, included in KeystoneML.

This pipeline for image classification reproduces a recent academic result in image classification from Chatfield, et. al. when applied to the “VOC 2007” dataset, and is automatically distributed to run on a cluster.

Users familiar with the new spark.ml package may recognize several similarities in the API concepts and interfaces presented between these two projects. This is no coincidence since we contributed to the design and initial implementation of spark.ml. However, KeystoneML provides both a richer set of built-in operators – for example, image featurizers and large-scale linear solvers – and modifies the spark.ml interface to provide type-safety and ensure further robustness.

To try out KeystoneML, head over the the project home page and check out the quick start and programming guides, and check out the code on GitHub. We’d love to hear feedback from early users — please feel free to join the mailing list.

Intel Shanghai presents: AMPCamp China!

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Our friends at Intel Shanghai are helping to spread the word about Spark!

AmpCamp@China 2015 will be held on May 23 2015 at Intel’s Zizhu Campus in Shanghai.

UC Berkeley AMPLab is not only a world leading research lab in big data and analytics, but have also produced BDAS (Berkeley Data Analytics Stack), an open source, next-generation data analytics stack that has wide adoption in both industry and research. Intel and AMPLab have long and close collaborations, especially on the open source development of many projects in BDAS (such as Spark, Shark, Tachyon, SparkR, GraphX, MLlib, etc.).

AMP CampChina LOGOv3This is the first time an AMP Camp will take place abroad! Between 2010 and 2014, Berkeley AMPLab successfully held five Amp Camp events in the States; the last event (AMPCamp 5) had over 200 people on-site and over 1800 views of the live stream from over 40 countries. This year, Intel Shanghai is collaborating with AMPLab to present this hugely successful event to the big data industry, developer community and academia in China.

The collaboration with UC Berkeley AMPLab is a part of Intel’s larger open source efforts in big data and analytics. For years, Intel has been playing a key role in many open source big data projects (e.g., Hadoop, Spark, HBase, Tachyon, etc.), collaborating with the academia to apply advanced big data research to real-world problems, and partnering with the industry and community to build web-scale data analytics systems.

Come and join AmpCamp@China 2015 to hear about the cutting edge components of BDAS, meet the world-class big data engineers (including Apache project committers and PMC members) at Intel’s Zizhu Campus in Shanghai, and get a sneak peek of the latest research progress from AMP Lab researchers.

For more information, please visit the Intel Shanghai AMPCamp page.

The exercises and course work from AMPCamp 5 will be used at this event, and are found here.

Big data meets… crapcan racing?

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

They say all work and no play makes Jack a dull boy…  And here at the AMPLab, we do occasionally take a break from our day jobs and have some fun.  For instance, when I’m not obsessing over our build system, I obsess over cars.  Not only do I have a massive toolbox on wheels with a wrench for every occasion, I also provide racetrack instruction and am building a car that will soon be competing in NASA Time Trial events.

Over the last several years, as described in this article, racing cars has become a popular pastime among the Silicon Valley tech crowd.  I discovered track events and racing almost seven years ago, but instead of the high-powered board room meeting track events, I specialize in my own special kind of grassroots obsession.  An obsession that somehow combines cars, racing, data and open source.

IMG_20150320_170552~2

Our “corporate” sponsor, meaning part of my paycheck goes to my share of team expenses.

It’s endurance racing…  And not just any kind of endurance racing, but crapcan racing for $500 cars in The 24 Hours of Lemons.  The idea behind Lemons is to keep costs down and make endurance racing something that anyone with the desire to do it an attainable and (mostly) affordable hobby.  Started in 2007, the series has grown from one race in California to being a national series with over 40 races per year.

It’s important to note that the $500 limit applies only to the car, and you can sell off things like the interior to bring the total spent to $500…  Got some nice OEM seats?  A decent interior?  Strip it out and sell it!  Safety equipment, on the other hand, like the racing seat and cage doesn’t count towards the total.  Neither do brakes, wheels, tires and “driver comfort” items (fancy steering wheel, driver cooling and hydration systems, etc).  All in all, a decent build will cost roughly $5000-$8000 when all is said and done.  To put this in perspective, this is less than a THIRD of the cost of a competitive and race-prepped Spec Miata, and that figure skyrockets if you want to race something like a Porsche.

Some action on the track. Source: Car & Driver Magazine

The fields are huge (180+ cars racing at once) and the variety of vehicles is staggering (including a 1961 Rambler American, lots of BMW E30s, various 1980s econoboxes, and the occasional Geo Metro powered by a rear-mounted Honda CBR motorcycle engine).  Theme is important, and some of these cars would make excellent Burning Man art cars.  You can read more about some of the crazier builds here…  It’s amazing what people put on the track, and how well some of them do.  Watching the race itself is a spectacle.

With cars like this, built by teams in their backyards, or even in fully stocked professional race shops, things are bound to go south.  In fact, this is so common, and teams scramble and help each other rebuilding and swapping engines overnight hoping to get back in to the race, that there’s even an award for this:  The Heroic Fix.  Some other awards are the “Index of Effluency”, “Organizers Choice”, “We Don’t Hate You, You Just Suck”, as well as the smallest (physical sized) trophies for the overall winners in the three classes (A, B and C, extremely loose categories based on the descending chance of either winning overall or just finishing the race).  It’s all a bit tongue-in-cheek, and meant to keep things from getting too serious.  All of us out there want to have fun, but without being overly competitive.  Cheating is expected, and bribes for the judges (usually booze) are plentiful. The grassroots community is amazing, and it’s the reason Lemons is so successful.

But, even though it’s friendly, the races are extraordinarily competitive.  The top teams use every trick possible to get an edge.  We, like many other teams, use data.

 

IMG_20150320_125954~3

Your couch won’t fit, but it sure does move.

My team, Captained by Chris More, runs a 1991 Mazda RX-7.  Our car is themed as a U-Haul truck, and the twist echoes the common sentiment that many people have when renting a nearly broken down box truck or sketchy trailer for a move:  FU-Haul

Unlike some of the junkers on the track, we’ve spent a lot of time making the car fast, reliable and competitive.  That being said, over the course of a race (usually 16 hours, split over two days), things that one wouldn’t expect to fail do, and sometimes in spectacular fashion.  In our most recent race (Sears Pointless, March 20-21 2015 at Sonoma Raceway), our transmission decided to spray it’s vital juices up through the gear shift gate and cover the entire interior of the car (including the driver) with a coating of smelly transmission fluid.  Our front brake rotors were extremely warped as well, and under heavy braking (the best kind), the steering wheel would jerk from side to side and almost be ripped from our grasp.  Not to mention the severe fuel cutoff issues when we were below half a tank of gas…  Thankfully, we were able to hold it together and finish the race.  We placed 11th out of 181 entries, nearly cracking the top 10!

This is some seriously cool hardware!

This is some seriously cool open source hardware!

But what does this have to do with big data and open source?

The four team members are all involved in high tech, consisting of Mozilla, Level 3 and Google alums.  We are all about open source, and love data…  And we collect in-car telemetry data with an open source hard- and software product from Autosport Labs called Race Capture Pro.

While we don’t use Spark for data processing (go go Google Sheets!), the data we collect is invaluable for helping us keep track of how the car and drivers perform during the race, as well as post-race bragging rights for who turned the fastest average lap (sadly, it wasn’t me).  With this data we were able to analyze things like our average pit stop time (~5 minutes, 30 seconds over 6 stops), each driver’s average lap time in traffic (Chris is the best), and with an open track (all four of us were within .6 seconds average over the entire weekend when turning fast laps, which is kind of amazing).

These metrics show that everyone except for Chris needs to improve their race traffic management skills, and that we need to bring our pit stop times down to at least 5 minutes to contend for a top-5 finish.  Our lap times are consistent and competitive, proven with data, and we know that the drivers and car are capable of more.

For a taste of how cool this data and product is, check out our statistics from the race.

For some fun reading, here’s a preview of the most recent race, and then coverage of the results.

And finally, here is some on-track action during my two hour long stint on the first day of racing…  Enjoy the wonderful sound of a rotary engine spinning up to nearly 9000rpm!

 

 

 

 

 

When Data Cleaning Meets Crowdsourcing

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

The vision of AMPLab is to integrate Algorithms (Machine Learning), Machines (Cloud Computing), and People (Crowdsourcing) together to make sense of Big Data. In the past several years, AMPLab has developed a variety of open source software components to fulfill this vision. For example, to integrate Algorithms and Machines, AMPLab is developing MLbase, a distributed machine learning system that aims to provide a declarative way to specify Machine Learning tasks.

One area we see great potential for adding People to the mix is Data Cleaning. Real-world data is often dirty (inconsistent, inaccurate, missing, etc.). Data analysts can spend over half of their time to clean data without doing any actual analysis. On the other hand, without data cleaning it is hard to obtain high-quality answers from dirty data. Crowdsourced data-cleaning systems could help analysts clean data more efficiently and more cheaply, which would significantly reduce the cost of the entire data analysis pipeline.

Table 1: An Example of Dirty Data (with format error, missing values, and duplicate values)

figure

In this post, I will highlight two AMPLab research projects aimed in this direction: (1) CrowdER, a hybrid human-machine entity resolution system; (2) SampleClean, a sample-and-clean framework for fast and accurate query processing on dirty data.

crowder
Entity resolution (ER) in database systems is the task of finding different records that refer to the same entity. ER is particularly important when integrating data from multiple sources. In such scenarios, it is not uncommon for records that are not exactly identical to refer to the same real-world entity. For example, consider the dirty data shown in Table 1. Records r1 and r3 in the table have different text in the City Name field, but refer to the same city.

A simple method to find such duplicate records is to ask the crowd to check all possible pairs and decide whether each item in the pair refers to the same entity or not. If a data set has n records, this human-only approach requires the crowd to examine O(n^2) pairs, which is infeasible for data sets of even moderate size. Therefore, in CrowdER we propose a hybrid human-machine approach. The intuition is that among O(n^2) pairs of records, the vast majority of pairs will be very dissimilar. Such pairs can be easily pruned using a machine-based technique. The crowd can then be brought in to examine the remaining pairs.

Of course, in practice there are many challenges that need to be addressed. For example, (i) how can we develop fast machine-based techniques for pruning dissimilar pairs; (ii) how can we reduce the crowd cost that is required to examine the remaining pairs? For the first challenge, we devise efficient similarity-join algorithms, which can prune dissimilar pairs from a trillion of pairs within a few minutes; For the second challenge, we identify the importance of exploiting transitivity to reduce the crowd cost and present a new framework for implementing this technique.

We evaluated CrowdER on several real-world datasets, where they are hard for machine-based ER techniques. Experimental results showed that CrowdER achieved more than 50% higher quality than these machine-based techniques, and at the same time, it was several orders of magnitude cheaper and faster than human-only approaches.

sampleclean
While crowdsourcing can make data cleaning more tractable, it is still highly inefficient for large datasets. To overcome this limitation, we started the SampleClean project. The project aims to explore how to obtain accurate query results from dirty data, by only cleaning a small sample of the data. The following figure illustrates why SampleClean can achieve this goal.

trade-off

In the figure, we compare the error in the query results returned by three query processing methods.

  • AllDirty does not clean any data, and simply runs a query over the entire original data set.
  • AllClean first cleans the entire data set and then runs a query over the cleaned data.
  • SampleClean is our new query processing framework that requires cleaning only a sample of the data.

We can see that SampleClean can return a more accurate query result than AllDirty by cleaning a relatively small subset of data. This is because SampleClean can leverage the cleaned sample to reduce the impact of data error on its query result, but AllDirty does not have such a feature. We can also see that SampleClean is much faster than AllClean since it only needs to clean a small sample of the data but AllClean has to clean the entire data set.

An initial version of the SampleClean system was demonstrated recently at AMPCamp 5 [slides] [video]. We envision that the SampleClean system can add data cleaning and crowdsourcing capabilities into BDAS (the Berkeley Data Analytics Stack), and enable BDAS to become a more powerful software stack to make sense of Big Data.

summary
Crowdsourcing is a promising way to scale up the inefficient process of cleaning data in the data analysis pipeline. However, crowdsourcing brings along with it a number of significant design challenges. In this post, I introduce CrowdER and SampleClean, two AMPLab’s research projects aimed at addressing this problem. Of course, there is a wide range of other open challenges to be researched in this area. We have collected a list of recently published papers on related topics by groups around the world. Interested readers can find them at this link.

AMP Camp 5

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?
AMP Camp 5 was a huge success!  Over 200 people participated in this sold-out event, we had over 1800 views of our live stream from over 40 countries, and we received overwhelmingly positive feedback.  In addition to learning about the Berkeley Data Analytics Stack (BDAS), participants were particularly interested in interacting with many of the lead developers of the BDAS software projects, who gave talks about their work and also served as teaching assistants during the hands-on exercises.

 

This 2-day event provided participants with hands-on experience using BDAS, the set of open-source projects including SparkSparkSQLGraphX, and MLlib/MLbase. For the fifth installment of AMP Camp, we expanded the curriculum to include the newest open-source BDAS projects including TachyonSparkRML Pipelines and ADAM, as well as a variety of research and use-case talks.

AMP Camp 5AMP Camp 5

 

Details about AMP Camp 5, including slides and videos from the talks as well as all of the training material for the hands-on exercises, are available on the AMP Camp website.

AMP Camp 5
 AMP Camp 5