Highlights From the AMPLab Winter 2012 Retreat

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?


The 2nd AMPLab research retreat was held Jan 11-13, 2012 at a mostly snowless Lake Tahoe.   120 people from 21 companies, several other schools and labs, and of course UC Berkeley spent 2.5 days getting an update on the current state of research in the lab, discussing trends and challenges in Big Data analytics, and sharing ideas, opinions and advice.   Unlike our first retreat, held last May, which was long on vision and inspiring guest speakers,  the focus of this retreat was current research results and progress.   Other than a few short overview/intro talks by faculty, virtually all of the talks (16 out of 17) were presented by students from the lab.   Some of these talks discussed research that had been recently published, but most of them discussed work that was currently underway, or in some cases, just getting started.

The first set of talks was focused on Applications.   Tim Hunter described how he and others used Spark to improve the scalability of the core traffic estimation algorithm used in the Mobile Millennium system, giving them the ability to run models faster than real-time and to scale to larger road networks.   Alex Kantchelian presented some very cool results for algorithmically detecting spam in tweet streams.    Matei Zaharia described a new algorithmic approach to Sequence Alignment called SNAP.   SNAP rethinks sequence alignment to exploit the longer reads that are being produced by modern sequencing machines and shows 10x to 100x speed ups over the state of the art, as well as improvements in accuracy.

The second technical session was about the Data Management portion of the BDAS (Berkeley Data Analytics System) stack that we are building in the AMPLab.    Newly-converted database professor Ion Stoica gave an overview of the components.   And then there were short talks on SHARK (an implementation of the Hive SQL processor on Spark),  Quicksilver – an approximate query answering system that is aimed at massive data,  scale-independent view maintenance in PIQL (the Performance Insightful Query Language), and a streaming (i.e., very low-latency) implementation of Spark.  These were presented by Reynold Xin, Sameer Agarwal, Michael Armbrust and Matei Zaharia, respectively.   Undergrad Ankur Dave wrapped up the session by wowing the crowd with a live demo of the Spark Debugger that he built – showing how the system can be used to isolate faults in some pretty gnarly, parallel data flows.

The Algorithms and People parts of the AMP agenda were represented in the 3rd technical session.  John Ducci presented his results on speeding up stochastic optimization for a host of applications.  He developed a parallelized method for introducing random noise into the process that leads to faster convergence.    Fabian Wauthier reprised his recent NIPS talk on detecting and correcting for Bias in crowdsourced input.   Beth Trushkowsky talked about “Getting it all from the Crowd”, and showed how we must think differently about the meaning of queries in a hybrid machine/human database system such as CrowdDB.

A session on Machine-focused topics included talks by Ali Ghodsi on the PacMan caching approach for map-reduce style workloads,  Patrick Wendell on early work on low-latency scheduling of parallel jobs, Mosharaf Chowdhury on fair sharing of network resources in large clusters, and Gene Pang on a new programming model and consistency protocol for applications that span multiple data centers.

The technical talks were rounded out by two presentations from students who worked with partner companies to get access to real workloads, logs and systems traces.  Yanpei Chen talked about an analysis of the characteristics of various MapReduce loads from a number of sources.   Ari Rabkin presented an analysis of trouble tickets from Cloudera.

As always, we got a lot of feedback from our Industrial attendees.   A vigorous debate broke out about the extent to which the lab should work on producing  a complete, industrial-strength analytics stack.   Some felt we should aim to match the success of earlier high-impact projects coming out of Berkeley, such as BSD and Ingres.  Others insisted that we focus on high-risk, further out research and leave the systems building to them.   There were also discussions about challenge applications (such as the Genomics X  Prize competition) and how to ensure that we achieve the high degree of integration among the Algorithms, Machines and People components of the work, which is the hallmark of our research agenda.   Another topic of great interest to the Industrial attendees was around how to better facilitate interactions and internships with the always amazing and increasingly in-demand students in the lab.

From a logistical point of view, we tried a few new things.   The biggest change was  with the poster session(s).   As always, the cost of admission for students was to present a poster of their current research.   This year, however, we also invited visitors to submit posters describing relevant work at their companies in general, and projects for which they were looking to hire interns in particular.   We then partitioned the posters into two separate poster sessions (one each night), thereby giving everyone a chance to spend more time discussing the projects that they were most interested in while still getting a chance to survey the wide scope of work being presented.   Feedback on both of these changes was overwhelmingly positive.  So we’ll likely stick to this new format for future retreats.

Kattt Atchley, Jon Kuroda and Sean McMahon did a flawless job of organizing the retreat.   Thanks to them and all the presenters and attendees for making it a very successful event.

Traffic jams, cell phones and big data

Timothy Hunter

(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly)

It is well known that big data processing is becoming increasingly important in many scientific fields including astronomybiomedicine and climatology.  In addition, newly created hybrid disciplines like biostatisics are an even stronger indicators of this overall trend. Other fields like civil engineering, and in particular transportation, are no exception to the rule and the AMP Lab is actively collaborating with the department of Intelligent Transportation Systems at Berkeley to explore this new frontier.

It comes as no surprise to residents of California that congestion on the streets is a major challenge that affects everyone. While it is well studied for highways, it remains an open question for urban streets (also called the arterial road network). So far, the most promising source of data is the GPS of cellphones. However, a large volume of this very noisy data is required in order to maintain a good level of accuracy. The rapid adoption of smartphones, all equipped with GPS, is changing the game. I introduce in this post some ongoing efforts to combine Mobile Millennium, a state-of-the-art transportation framework, with the AMPLab software stack.

What does this GPS data look like? Here is an example in the San Francisco Bay area: a few hundred taxicabs relay their position every minute in real time to our servers.

The precise trajectories of the vehicles are unobserved and need to be reconstructed using a sophisticated map matching pipeline implemented in Mobile Millennium. The results of this process are some timestamped trajectory segments. These segments are the basic observations to predict traffic.


Our traffic estimation algorithms work by guessing a probability distribution of the travel time on each link of the road network. This process is iterated to improve the quality of estimates. This overall algorithm is intensive both in terms of computations and memory. Fortunately, it also fits into the category of “embarrassingly parallel” algorithms and is a perfect candidate for distributed computing.
Implementing a high-performance algorithm as a distributed system is not an easy task. Instead of implementing this by hand, our implementation relies on Spark, a programming framework in Scala. Thanks to the Spark framework, we were able to port our single machine implementation to the EC2 cloud within a few weeks to achieve nearly linear scaling. In a future post, I will discuss some practical considerations we faced to when  integrating the AMPLab stack with the Mobile Millennium system.

 

Trip Report from the NIPS Big Learning Workshop

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

A few weeks ago, I went to the Big Learning workshop at NIPS, held in Spain. The workshop brought together researchers in large-scale machine learning, an area near and dear to the AMP Lab’s goal of integrating Algorithms, Machines, and People to tame big data, and contained a lot of interesting work. There were about ten invited talks and ten paper presentations. I myself gave an invited talk on Spark, our framework for large-scale parallel computing, which won a runner-up best presentation award.

The topics presented ranged from FPGAs to accelerate vision algorithms in embedded devices, to GPU programming, to cloud computing on commodity clusters. For me, some highlights included the discussion on training the Kinect pose recognition algorithm using DryadLINQ, which ran on several thousand cores and had to overcome substantial fault mitigation and I/O challenges; and the GraphLab presentation from CMU, which discussed many interesting applications implemented using their asynchronous programming model. Daniel Whiteson from UC Irvine also gave an extremely entertaining talk on the role of machine learning in the search for new subatomic particles.

One of the groups I was happy to see represented was the Scala programming language team from EPFL. Scala features prominently as a high-level language for parallel computing. We use it in the Spark programming framework in our lab, as well as the SCADS scalable key-value store. It’s also used heavily in the Pervasive Parallelism Lab at a certain school across the bay. It was good to hear that the Scala team is working on new features that will make the language easier to use as a DSL for parallel computing, making it simpler to build highly expressive programming tools in Scala such as Spark.

The AMP Lab was also represented by John Duchi, who presented a new algorithm for stochastic gradient descent in non-smooth problems that is the first parallelizable approach for these problems, and Ariel Kleiner and Ameet Talwalkar, who presented the Bag of Little Bootstraps, a scalable bootstrap algorithm based on subsampling. It’s certainly neat to see two successes in parallelizing very disparate statistical algorithms one year into the AMP Lab.

In summary, the workshop showcased very diverse ideas and showed that big learning is a hot field. It was the biggest workshop at NIPS this year. In the future, as users gain experience with the various programming models and the best algorithms for each problem type are found, we expect to see some consolidation of these ideas into unified
stacks of composable tools. Designing and building such a stack is one of the main goals of our lab.