Traffic jams, cell phones and big data

(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly)

It is well known that big data processing is becoming increasingly important in many scientific fields including astronomy, biomedicine and climatology. In addition, newly created hybrid disciplines like biostatisics are an even stronger indicators of this overall trend. Other fields like civil engineering, and in particular transportation, are no exception to the rule and the AMP Lab is actively collaborating with the department of Intelligent Transportation Systems at Berkeley to explore this new frontier.

It comes as no surprise to residents of California that congestion on the streets is a major challenge that affects everyone. While it is well studied for highways, it remains an open question for urban streets (also called the arterial road network). So far, the most promising source of data is the GPS of cellphones. However, a large volume of this very noisy data is required in order to maintain a good level of accuracy. The rapid adoption of smartphones, all equipped with GPS, is changing the game. I introduce in this post some ongoing efforts to combine Mobile Millennium, a state-of-the-art transportation framework, with the AMPLab software stack.

What does this GPS data look like? Here is an example in the San Francisco Bay area: a few hundred taxicabs relay their position every minute in real time to our servers.

The precise trajectories of the vehicles are unobserved and need to be reconstructed using a sophisticated map matching pipeline implemented in Mobile Millennium. The results of this process are some timestamped trajectory segments. These segments are the basic observations to predict traffic.

Our traffic estimation algorithms work by guessing a probability distribution of the travel time on each link of the road network. This process is iterated to improve the quality of estimates. This overall algorithm is intensive both in terms of computations and memory. Fortunately, it also fits into the category of “embarrassingly parallel” algorithms and is a perfect candidate for distributed computing.
Implementing a high-performance algorithm as a distributed system is not an easy task. Instead of implementing this by hand, our implementation relies on Spark, a programming framework in Scala. Thanks to the Spark framework, we were able to port our single machine implementation to the EC2 cloud within a few weeks to achieve nearly linear scaling. In a future post, I will discuss some practical considerations we faced to when integrating the AMPLab stack with the Mobile Millennium system.