An NLP library for Matlab

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Matlab is a great language for prototyping ideas. It comes with many libraries specially for machine learning and statistics. But  when it comes to processing the natural language Matlab is extremely slow. Because of this, many researchers use other languages to pre-process the text, convert the text to numerical data and then bring the resulting data to Matlab for more analysis.

I used to use Java for this. I would usually tokenize the text with Java, then save the resulting matrices to the disk and read them in Matlab. After a while this procedure became cumbersome. I had to go back and forth between Java and Matlab, the procedure is prone to human errors and the codebase just  looks ugly.

Recently, together with Jason Chen, we have started to put together an NLP toolbox for Matlab. It is still a work in progress and we are still developing the toolbox but you can download the latest version from our github repository [link]. There is also an installation guide that helps you properly install it on your machine. I have built a simple map-reduce tool that allows you to utilize all of cores on the CPU for many functions.

So far the toolbox has modules for text tokenization (Bernoulli, Multinomial, tf-idf, n-gram tools), text preprocessing (stop word removal, text cleaning, stemming) and some learning algorithms (linear regression, decision trees, support vector machines and a Naïve Baye’s classifier). we have also implemented evaluation metrics (precession, recall, F1-score and MSE). The support vector machine tool is a wrapper around the famous LibSVM and we are working on another wrapper for SVM-light. A part-of-speech tagger is coming very soon too.

I have been focusing on getting different parts running efficiently. For example, the tokenizer uses Matlab’s native hashmap data structure (container maps) to efficiently pass over the corpus and tokenize it.

We are also adding examples and demos for this toolbox. The first example is a sentiment analysis tool that uses this library to predict whether a movie review is positive or negative. The code reaches the F1 score of 0.83, meaning that out of 200 movie reviews it made a mistake in classifying only 26 of them.

Please try the toolbox and note that it is still a work in progress, some functions are still slow and we are working to improve them. I would love to hear what you think. If you want us to implement something that might be useful to you just let us know.

Connecting Big Data around the World

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

The internet enables millions of users world-wide to create, modify and share data through platforms like Twitter, Facebook, GMail and many other services generating gigantic data sets. This world-wide data access requires replicating the data across multiple data centers, not only to bring the data closer to the user for shorter latencies, but also to increases the availability in case of a data center failure.
However, keeping replicas synchronized and consistent so that a user’s data is never lost and up-to-date, is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Therefore, synchronous wide-area replication has been assumed unfeasible with strong consistency for interactive applications and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, or relaxed consistency, which for example can cause updates to appear and disappear from the application in an unpredictable fashion.

With MDCC (Multi-Data Center Consistency), we describe the first synchronous replication protocol, that does not require a master or static partitioning, and is strongly consistent at a cost similar to eventually consistent protocols by using only a single round-trip across data centers in the normal operational case to apply an update. That is, not only do users get faster response times by locating the data close to them, but also they always experience the same consistency and application behavior regardless of the presence of major failures. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication more effectively. Our evaluation using the TPC-W benchmark, a benchmark simulating a web-shop like Amazon, with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve transaction throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data-center outage without a significant impact on response times, all while guaranteeing strong consistency.

For more information please visit our MDCC web-site.

MDCC was developed by Tim Kraska, Gene Pang, Mike Franklin and Samuel Madden.

Sweet Storage SLOs with Frosting

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

A typical page load requires sourcing and combining many pieces of data. For example, a frontend application like a newsfeed requires making many storage requests to fetch your name, your photo, your friends and their photos, and your friends’ most recent posts. Since loading a page requires making many of these storage requests, controlling storage request latency is crucial to reducing overall page load times.

Storage systems are thus provisioned and tuned to meet these latency requirements. However, this requires provisioning for peak, not average, load. This means that the hardware is often underutilized. MapReduce batch analytics jobs are perfect for taking up this excess slack capacity in the system. However, traditional storage systems are unable to support both a MapReduce and frontend workload without adversely affecting frontend latency. This is compounded by the dynamic, time-variant nature of a frontend workload, which makes it difficult to tune the storage system for a single set of conditions.

This is what motivated our work on Frosting. Frosting is a request scheduling layer on top of HBase, a distributed column-store, which dynamically tunes its internal scheduling to meet the requirements of the current workload. Application programmers directly specify high-level performance requirements to Frosting in the form of service-level objectives (SLOs), which are throughput or latency requirements on operations to HBase. Frosting then carefully admits requests to HBase such that these SLOs are met.

In the case of combining a high-priority, latency-sensitive frontend workload and a low-priority, throughput-oriented MapReduce workload, Frosting will continually monitor the frontend’s empirical latency and only admit requests from MapReduce when the frontend’s SLO is satisfied. For instance, if the frontend is easily meeting its latency target, Frosting might choose to admit more MapReduce requests since there is slack capacity in the system. If the frontend latency increases above its SLO due to increased load, Frosting will accordingly admit fewer MapReduce requests.

This is ongoing work with Shivaram Venkataraman (shivaram@eecs) and Sara Alspaugh (alspaugh@eecs). No paper is yet available publicly. However, we’d love to talk to you if you’re interested in Frosting, especially if you use HBase in production, or have workload traces that we could get access to.

Energy Debugging with Carat Enters Beta

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Carat is a new research project in the AMP Lab that aims to detect energy bugs—app behavior that is consuming energy unnecessarily—using data collected from a community of mobile devices. Carat provides users with actions they can take to improve battery life (and the expected improvements).

Carat collects usage data on devices (we care about privacy), aggregates these data in the cloud, performs a statistical analysis using Spark, and reports the results back to users. In addition to the Action List shown in the figure, the app empowers users to dive into the data, answering questions like How does my energy use compare to similar devices? and What specific information is being sent to the server?

The key insight of our approach is that we can acquire implicit statistical specifications of what constitutes “normal” energy use under different circumstances. This idea of statistical debugging has been applied to correctness and performance bugs, but this is the first application to energy bugs. The project faces a number of interesting (and sometimes distinguishing) technical challenges, such as accounting for sampling bias, reasoning with noisy and incomplete information, and providing users with an experience that rewards them for participating.

We need your help testing our iOS implementation and gathering some initial data! If you have an iPhone or iPad with iOS 5.0 or later and are willing to give us a few minutes of your time, please click here.

Getting It All from the Crowd

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

What does a query result mean when the data comes from the crowd? This is one of the fundamental questions raised by CrowdDB, a hybrid human/machine database system developed here in the AMP lab. For example, consider what could be thought of as the simplest query: SELECT * FROM table. If tuples are being provided by the crowd, how do you know when the query is complete? Can you really get them all?

In traditional database systems, query processing is based on the closed-world assumption: all data relevant to a query is assumed to reside in the database. When the data is crowdsourced, we find that this assumption no longer applies; existing information could be extended by further by asking the crowd. However, in our current work we show that it is possible to understand query results in the “open-world” by reasoning about query completeness and the cost-benefit tradeoff of acquiring more data.

Consider asking workers on a crowdsourcing platform (e.g., Amazon Mechanical Turk) to provide items in a set (one at a time). As you can imagine, answers arriving from the crowd follow a pattern of diminishing returns: initially there is a high rate of arrival for previously unseen answers, but as the query progresses the arrival rate of new answers begins to taper off. The figure below shows an example of this curve when we asked workers to provide names of the US States.

average SAC from States experiment

Number of unique answers seen vs. total number of answers in the US States experiment (average)

This behavior is well-known in fields such as biology and statistics, where this type of figure is known as the Species Accumulation Curve (SAC). This analysis is part of the species estimation problem; the goal is to estimate the number of distinct species using observations of species in the locale of interest. We apply these techniques in the new context of crowdsourced queries by drawing an analogy between observed species and worker answers from the crowd. It turned out that the estimation algorithms sometimes fail due to crowd-specific behaviors like some workers providing many more answers than others (“streakers vs. samplers”). We address this by designing a heuristic that reduces the impact of overzealous workers. We also demonstrate a heuristic to detect when workers are consulting the same list on the web, helpful if the system wants to switch to another data gathering regime like webpage scraping.

Species estimation techniques provide a way to reason about query results, despite being in the open world. For queries with a bounded result size, we can form a progress estimate as answers arrive by predicting the cardinality of the result set. Of course, some sets are very large or contain items that few workers would think of or find (the long tail), so it does not make sense to try to predict set size. For these cases, we propose a pay-as-you-go-approach to directly consider the benefit of asking the crowd for more answers.

For more details, please check out the paper.

Highlights From the AMPLab Winter 2012 Retreat

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?


The 2nd AMPLab research retreat was held Jan 11-13, 2012 at a mostly snowless Lake Tahoe.   120 people from 21 companies, several other schools and labs, and of course UC Berkeley spent 2.5 days getting an update on the current state of research in the lab, discussing trends and challenges in Big Data analytics, and sharing ideas, opinions and advice.   Unlike our first retreat, held last May, which was long on vision and inspiring guest speakers,  the focus of this retreat was current research results and progress.   Other than a few short overview/intro talks by faculty, virtually all of the talks (16 out of 17) were presented by students from the lab.   Some of these talks discussed research that had been recently published, but most of them discussed work that was currently underway, or in some cases, just getting started.

The first set of talks was focused on Applications.   Tim Hunter described how he and others used Spark to improve the scalability of the core traffic estimation algorithm used in the Mobile Millennium system, giving them the ability to run models faster than real-time and to scale to larger road networks.   Alex Kantchelian presented some very cool results for algorithmically detecting spam in tweet streams.    Matei Zaharia described a new algorithmic approach to Sequence Alignment called SNAP.   SNAP rethinks sequence alignment to exploit the longer reads that are being produced by modern sequencing machines and shows 10x to 100x speed ups over the state of the art, as well as improvements in accuracy.

The second technical session was about the Data Management portion of the BDAS (Berkeley Data Analytics System) stack that we are building in the AMPLab.    Newly-converted database professor Ion Stoica gave an overview of the components.   And then there were short talks on SHARK (an implementation of the Hive SQL processor on Spark),  Quicksilver – an approximate query answering system that is aimed at massive data,  scale-independent view maintenance in PIQL (the Performance Insightful Query Language), and a streaming (i.e., very low-latency) implementation of Spark.  These were presented by Reynold Xin, Sameer Agarwal, Michael Armbrust and Matei Zaharia, respectively.   Undergrad Ankur Dave wrapped up the session by wowing the crowd with a live demo of the Spark Debugger that he built – showing how the system can be used to isolate faults in some pretty gnarly, parallel data flows.

The Algorithms and People parts of the AMP agenda were represented in the 3rd technical session.  John Ducci presented his results on speeding up stochastic optimization for a host of applications.  He developed a parallelized method for introducing random noise into the process that leads to faster convergence.    Fabian Wauthier reprised his recent NIPS talk on detecting and correcting for Bias in crowdsourced input.   Beth Trushkowsky talked about “Getting it all from the Crowd”, and showed how we must think differently about the meaning of queries in a hybrid machine/human database system such as CrowdDB.

A session on Machine-focused topics included talks by Ali Ghodsi on the PacMan caching approach for map-reduce style workloads,  Patrick Wendell on early work on low-latency scheduling of parallel jobs, Mosharaf Chowdhury on fair sharing of network resources in large clusters, and Gene Pang on a new programming model and consistency protocol for applications that span multiple data centers.

The technical talks were rounded out by two presentations from students who worked with partner companies to get access to real workloads, logs and systems traces.  Yanpei Chen talked about an analysis of the characteristics of various MapReduce loads from a number of sources.   Ari Rabkin presented an analysis of trouble tickets from Cloudera.

As always, we got a lot of feedback from our Industrial attendees.   A vigorous debate broke out about the extent to which the lab should work on producing  a complete, industrial-strength analytics stack.   Some felt we should aim to match the success of earlier high-impact projects coming out of Berkeley, such as BSD and Ingres.  Others insisted that we focus on high-risk, further out research and leave the systems building to them.   There were also discussions about challenge applications (such as the Genomics X  Prize competition) and how to ensure that we achieve the high degree of integration among the Algorithms, Machines and People components of the work, which is the hallmark of our research agenda.   Another topic of great interest to the Industrial attendees was around how to better facilitate interactions and internships with the always amazing and increasingly in-demand students in the lab.

From a logistical point of view, we tried a few new things.   The biggest change was  with the poster session(s).   As always, the cost of admission for students was to present a poster of their current research.   This year, however, we also invited visitors to submit posters describing relevant work at their companies in general, and projects for which they were looking to hire interns in particular.   We then partitioned the posters into two separate poster sessions (one each night), thereby giving everyone a chance to spend more time discussing the projects that they were most interested in while still getting a chance to survey the wide scope of work being presented.   Feedback on both of these changes was overwhelmingly positive.  So we’ll likely stick to this new format for future retreats.

Kattt Atchley, Jon Kuroda and Sean McMahon did a flawless job of organizing the retreat.   Thanks to them and all the presenters and attendees for making it a very successful event.

Trip Report from the NIPS Big Learning Workshop

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

A few weeks ago, I went to the Big Learning workshop at NIPS, held in Spain. The workshop brought together researchers in large-scale machine learning, an area near and dear to the AMP Lab’s goal of integrating Algorithms, Machines, and People to tame big data, and contained a lot of interesting work. There were about ten invited talks and ten paper presentations. I myself gave an invited talk on Spark, our framework for large-scale parallel computing, which won a runner-up best presentation award.

The topics presented ranged from FPGAs to accelerate vision algorithms in embedded devices, to GPU programming, to cloud computing on commodity clusters. For me, some highlights included the discussion on training the Kinect pose recognition algorithm using DryadLINQ, which ran on several thousand cores and had to overcome substantial fault mitigation and I/O challenges; and the GraphLab presentation from CMU, which discussed many interesting applications implemented using their asynchronous programming model. Daniel Whiteson from UC Irvine also gave an extremely entertaining talk on the role of machine learning in the search for new subatomic particles.

One of the groups I was happy to see represented was the Scala programming language team from EPFL. Scala features prominently as a high-level language for parallel computing. We use it in the Spark programming framework in our lab, as well as the SCADS scalable key-value store. It’s also used heavily in the Pervasive Parallelism Lab at a certain school across the bay. It was good to hear that the Scala team is working on new features that will make the language easier to use as a DSL for parallel computing, making it simpler to build highly expressive programming tools in Scala such as Spark.

The AMP Lab was also represented by John Duchi, who presented a new algorithm for stochastic gradient descent in non-smooth problems that is the first parallelizable approach for these problems, and Ariel Kleiner and Ameet Talwalkar, who presented the Bag of Little Bootstraps, a scalable bootstrap algorithm based on subsampling. It’s certainly neat to see two successes in parallelizing very disparate statistical algorithms one year into the AMP Lab.

In summary, the workshop showcased very diverse ideas and showed that big learning is a hot field. It was the biggest workshop at NIPS this year. In the future, as users gain experience with the various programming models and the best algorithms for each problem type are found, we expect to see some consolidation of these ideas into unified
stacks of composable tools. Designing and building such a stack is one of the main goals of our lab.

An AMP Blab about some recent system conferences – Part 3: Hadoop World 2011

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student – in no way does it represent any official or unanimous AMP Lab position.

Part 3: Hadoop World 2011

Not exactly a research conference, Hadoop World is a multi-track industry convention hosted by Cloudera, an enterprise Hadoop vendor, and draws various companies with some stake in the Hadoop community. This year’s Hadoop World saw some 1500 attendees, including Hadoop vendors, Hadoop users, executives from various companies, vendors building on top of Hadoop, people looking to learn more about Hadoop, and of course, a small contingent of researchers. I believe Hadoop World is a good place for researchers to get a state-of-the-industry view of the big data, big systems space.

One theme is that Hadoop has really become “mainstream”, and moved much beyond its initial use cases in supporting e-commerce type services. The convention agenda included talks from household names beyond typical high-tech industries. The talks also had audiences in ripped jeans and flip flops sitting next to others in pressed three piece suits, indicating the present diversity of the community, and perhaps pointing to opportunities for multi-disciplinary collaboration in the near future.

Accel Partners announced a $100M “Big data fund” to accelerate innovation in all layers of the “big data stack”. This should be of interest to entrepreneurial-minded students within the Lab.

Another theme is that Hadoop is still waiting for a “killer app”. One keynote speaker dubbed 2012 to be “the year of apps”. In other words, the Hadoop infrastructure is sufficient to be “enterprise ready”; therefore innovation should now focus on using Hadoop to derive business value.

Also, the “data scientist” role is gaining prominence. Jeff Hammerbacher pioneered this role at Facebook. Companies across many industries are looking for similarly skilled people to make sense of the data deluge that’s happening everywhere. This role requires some combination of expertise in computer science, statistics, social science, natural science, business, and other skills. AMP Lab is rooted in computer science and statistics, and depending on individual students interests, also reasonably literate in social science/natural science/business areas. I certainly found it motivational to see the countless ways that the Lab’s expertise can be applied to create business value, help improve the quality of life, and even discover new knowledge.

NetApp and Cloudera announced a partnership in providing the NetApp Open Solution for Hadoop running on Cloudera Distribution including Apache Hadoop. It’s great to see increased collaboration between our industry partners beyond knowledge sharing through the Lab.

I gave a joint talk on “Hadoop and Performance” with Todd Lipcon, our colleague from Cloudera. The talk was well received, and folks are looking forward to our imminent release of the “Cloudera Hadoop workload suite”. One could say that the focus of typical enterprises should be either profit (monetary and societal), or arguing that “my-performance-is-better”. Thus, it remains the academic community’s responsibility and opportunity to develop scientific design and performance evaluation methodologies.

No travel notes this time.

An AMP Blab about some recent system conferences – Part 2: SOCC 2011

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student – in no way does it represent any official or unanimous AMP Lab position.

Part 2: Symposium on Cloud Computing (SOCC) 2011

This year represents the second iteration of the conference. SOCC has certainly established itself as a noteworthy conference that brings together diverse computer system specialties. The proceedings are available through ACM. Perhaps SOCC would become a stand-alone venue next year, instead of being co-located with SIGMOD (last year) or SOSP (this year).

AMP Lab is fortunate to have inherited many members from its predecessor RAD Lab, which made some contributions in highlighting cloud computing as an important technology trend and emerging research area. The numerous SOCC papers on MapReduce optimizations and key-values stores continues the research paths that RAD Lab helped identify regarding MapReduce schedulers and scale-independent storage.

The program committee awarded three “papers of distinction”: 1. Pesto: Online Storage Performance Management in Virtualized Datacenters, 2. Opportunistic Flooding to Improve TCP Transmit Performance in Virtualized Clouds, 3. PrIter: A Distributed Framework for Prioritized Iterative Computations. I especially liked the TCP paper – the authors actually modified the TCP kernel, a painful task per my own past experience.

Our AMP Lab colleagues presented two talks – Improving Per-Node Efficiency in the Datacenter with New OS Abstractions (Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer), and Scaling the Mobile Millennium System in the Cloud (Timothy Hunter, Teodor Moldovan, Matei Zaharia, Justin Ma, Samy Merzgui, Michael Franklin, Pieter Abbeel, and Alexandre Bayen). Both went very well.

One train of thought that appeared several times – how do the system improvements demonstrated over artificial benchmarks translate to real life situations. Folks from different organizations raised this point during Q&A for several papers, with the response being the familiar lament regarding the shortage of large scale system traces available to academia. This prompted our friend John Wilkes from Google to give an 1-slide impromptu presentation highlighting the imminent public release of some large scale Google cluster traces, and inviting researchers to work with Google. I felt it helpful to do an 1-slide impromptu follow-up presentation highlighting that AMP Lab has access to large scale system traces from several different organizations, inviting researchers to work with AMP Lab and our industrial partners, and of course thanking our Google colleagues John Wilkes, Joseph L. Hellerstein, and others for their guidance on our early efforts to understand large scale system workloads.

Portugal travel note 2: Consider taking in the stunning sunset at Castelo de Sao Jorge, set against the 25 de Abril Bridge across the River Tejo, with the Cristo Rei Statue lit by bright light on the opposite side of the River. Walking about the medieval Castle in semi-darkness is a unique and almost haunting experience, provided you can muster the courage and the night-vision. Or just head to the Bairro Alto historical neighborhood and stuff yourself on fantastic local food.

Latencies Gone Wild!

Error: Unable to create directory uploads/2025/04. Is its parent directory writable by the server?

Cloud services are becoming popular for large-scale computing and data management.  Amazon EC2 is a commonly used cloud service by many individuals and companies, and has clusters in 5 different regions: US East, US West, EU, Asia (Singapore) and Asia (Tokyo).  However, failures can happen, even to entire clusters and regions.  Amazon suffered a failure for several days in the east region in April 2011.  Eventually most services were restored, but 0.4% of database data could not be restored and was lost.  Therefore, if distributed systems must be highly available, they must be replicated across data centers.  In addition, no data loss and higher levels of consistency can only be achieved through synchronous replication.

Since spanning multiple regions is important for reliable distributed systems, the latencies of network messages are affected by the long distances.  When two machines across the country or the globe need to communicate, the speed of light limits the lower bound of network latencies.  For example, if 4,000 kilometers span between California and Virginia, the speed of light dictates that the theoretical lower bound of any round trip message is at least 26 milliseconds.  RPCs within a single data center usually take less than 1 millisecond to complete, but RPCs to different regions are expected to take around 100 milliseconds or more.We ran a few experiments on EC2 to measure cross data center message delays to get a better idea of how different regions affect the latencies.

For the first experiment, we measured simulated 2048-byte echo RPCs between two machines in 3 different scenarios: both machines in the same data center, both machines in the same region, but different data centers, and both machines in different regions.

Data Center Labels:
west1 – data center 1 in the US West region
west2 – data center 2 in the US West region
east1 – data center 1 in the US East region
west1west1 west1west2 west1east1
average 0.68 ms 1.68 ms 83.11 ms
99th percentile 0.88 ms 1.90 ms 83.68 ms

From the numbers, it is obvious that network latencies between the west and east coast of the US are about 2 orders of magnitude longer than latencies within a single data center.

Our second experiment measured latencies between some of the other regions for longer periods of time.  We collected latency measurements for about a week for RPCs between different regions.

The 4 cross-regions tested:
East (US) – EU (Ireland)
East (US) – Tokyo
West (US) – EU (Ireland)
West (US) – Tokyo
This shows that the latencies between distant regions can vary wildly.  There were some spikes of RPCs which took longer than a minute and there were periods of time when the latencies were consistently almost a second long.

These experiments show that the message delays can have a lot of variation and spikes of high latencies can be expected for cross data center network traffic.  Globally reliable systems will need to expect longer network message delays, and deal with them.  Common techniques either suffer data loss, or do not handle the longer latencies to achieve good performance.  New solutions will have to be developed in order to provide fault tolerant, reliable, globally distributed systems with usable performance.  Stay tuned for details on our new project addressing this issue.