<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>AMP Lab - UC Berkeley</title>
	<atom:link href="http://amplab.cs.berkeley.edu/feed/" rel="self" type="application/rss+xml" />
	<link>http://amplab.cs.berkeley.edu</link>
	<description>Algorithms, Machines and People Lab</description>
	<lastBuildDate>Wed, 16 May 2012 09:30:07 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>An NLP library for Matlab</title>
		<link>http://amplab.cs.berkeley.edu/2012/05/05/an-nlp-library-for-matlab/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/05/05/an-nlp-library-for-matlab/#comments</comments>
		<pubDate>Sat, 05 May 2012 09:23:17 +0000</pubDate>
		<dc:creator>faridani</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1959</guid>
		<description><![CDATA[Matlab is a great language for prototyping ideas. It comes with many libraries specially for machine learning and statistics. But  when it comes to processing the natural language Matlab is extremely slow. Because of this, many researchers use other languages to pre-process the text, &#8230; <a href="http://amplab.cs.berkeley.edu/2012/05/05/an-nlp-library-for-matlab/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-1960 alignright colorbox-1959" title="matlab nlp" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/05/matlab-nlp.png" alt="" width="180" height="136" /></p>
<p>Matlab is a great language for prototyping ideas. It comes with many libraries specially for machine learning and statistics. But  when it comes to processing the natural language Matlab is extremely slow. Because of this, many researchers use other languages to pre-process the text, convert the text to numerical data and then bring the resulting data to Matlab for more analysis.</p>
<p>I used to use Java for this. I would usually tokenize the text with Java, then save the resulting matrices to the disk and read them in Matlab. After a while this procedure became cumbersome. I had to go back and forth between Java and Matlab, the procedure is prone to human errors and the codebase just  looks ugly.</p>
<p>Recently, together with Jason Chen, we have started to put together an NLP toolbox for Matlab. It is still a work in progress and we are still developing the toolbox but you can download the latest version from our github repository [<a href="https://github.com/faridani/MatlabNLP">link</a>]. There is also an installation guide that helps you properly install it on your machine. I have built a simple map-reduce tool that allows you to utilize all of cores on the CPU for many functions.</p>
<p>So far the toolbox has modules for text tokenization (Bernoulli, Multinomial, tf-idf, n-gram tools), text preprocessing (stop word removal, text cleaning, stemming) and some learning algorithms (linear regression, decision trees, support vector machines and a Naïve Baye’s classifier). we have also implemented evaluation metrics (precession, recall, F1-score and MSE). The support vector machine tool is a wrapper around the famous LibSVM and we are working on another wrapper for SVM-light. A part-of-speech tagger is coming very soon too.</p>
<p>I have been focusing on getting different parts running efficiently. For example, the tokenizer uses Matlab’s native hashmap data structure (container maps) to efficiently pass over the corpus and tokenize it.</p>
<p>We are also adding examples and demos for this toolbox. The first example is a sentiment analysis tool that uses this library to predict whether a movie review is positive or negative. The code reaches the F1 score of 0.83, meaning that out of 200 movie reviews it made a mistake in classifying only 26 of them.</p>
<p>Please try the toolbox and note that it is still a work in progress, some functions are still slow and we are working to improve them. I would love to hear what you think. If you want us to implement something that might be useful to you just let us know.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/05/05/an-nlp-library-for-matlab/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Connecting Big Data around the World</title>
		<link>http://amplab.cs.berkeley.edu/2012/03/29/connecting-big-data-around-the-world/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/03/29/connecting-big-data-around-the-world/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 08:00:37 +0000</pubDate>
		<dc:creator>Tim Kraska</dc:creator>
				<category><![CDATA[SCADS]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1659</guid>
		<description><![CDATA[The internet enables millions of users world-wide to create, modify and share data through platforms like Twitter, Facebook, GMail and many other services generating gigantic data sets. This world-wide data access requires replicating the data across multiple data centers, not &#8230; <a href="http://amplab.cs.berkeley.edu/2012/03/29/connecting-big-data-around-the-world/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The internet enables millions of users world-wide to create, modify and share data through platforms like Twitter, Facebook, GMail and many other services generating gigantic data sets. This world-wide data access requires replicating the data across multiple data centers, not only to bring the data closer to the user for shorter latencies, but also to increases the availability in case of a data center failure.<br />
However, keeping replicas synchronized and consistent so that a user&#8217;s data is never lost and up-to-date, is expensive. Inter-data center network delays are in the <a title="Latencies Gone Wild!" href="http://amplab.cs.berkeley.edu/2011/10/20/latencies-gone-wild/">hundreds of milliseconds and vary significantly</a>. Therefore, synchronous wide-area replication has been assumed unfeasible with strong consistency for interactive applications and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, or relaxed consistency, which for example can cause updates to appear and disappear from the application in an unpredictable fashion.</p>
<p><a href="http://mdcc.cs.berkeley.edu/"><img class="alignright colorbox-1659" title="mdcc_logo_small" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mdcc_logo_small1.png" alt="" width="162" height="128" /></a>With <a title="MDCC Website" href="http://mdcc.cs.berkeley.edu/"><strong>MDCC </strong>(Multi-Data Center Consistency)</a>, we describe the first synchronous replication protocol, that does not require a master or static partitioning, and is strongly consistent at a cost similar to eventually consistent protocols by using only a single round-trip across data centers in the normal operational case to apply an update. That is, not only do users get faster response times by locating the data close to them, but also they always experience the same consistency and application behavior regardless of the presence of major failures. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication more effectively. Our evaluation using the TPC-W benchmark, a benchmark simulating a web-shop like Amazon, with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve transaction throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data-center outage without a significant impact on response times, all while guaranteeing strong consistency.</p>
<p>For more information please visit our <a title="MDCC web-site" href="http://mdcc.cs.berkeley.edu/">MDCC web-site</a>.</p>
<p>MDCC was developed by <a title="Tim Kraska" href="http://www.cs.berkeley.edu/~kraska/">Tim Kraska</a>, <a title="Gene Pang" href="http://www.cs.berkeley.edu/~gpang/">Gene Pang</a>, <a title="Michael J. Franklin" href="http://www.cs.berkeley.edu/~franklin/">Mike Franklin</a> and <a title="Samuel Madden" href="http://db.lcs.mit.edu/madden/">Samuel Madden</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/03/29/connecting-big-data-around-the-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>News from the First Two Spark User Meetups</title>
		<link>http://amplab.cs.berkeley.edu/2012/03/28/news-from-the-first-two-spark-user-meetups/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/03/28/news-from-the-first-two-spark-user-meetups/#comments</comments>
		<pubDate>Wed, 28 Mar 2012 21:04:07 +0000</pubDate>
		<dc:creator>Matei Zaharia</dc:creator>
				<category><![CDATA[Spark]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1677</guid>
		<description><![CDATA[One of the neat things about doing research in big data has always been the strong open source culture in this field &#8212; many of the widely used software projects are open source, and if you release a new algorithm &#8230; <a href="http://amplab.cs.berkeley.edu/2012/03/28/news-from-the-first-two-spark-user-meetups/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One of the neat things about doing research in big data has always been the strong open source culture in this field &#8212; many of the widely used software projects are open source, and if you release a new algorithm or tool, there&#8217;s a chance that someone will use it. About a year ago, we started making open source releases of <a href="http://www.spark-project.org">Spark</a>, one of the first parallel data processing frameworks in the AMP Lab stack. Spark promises 10-20x faster performance than existing tools thanks to its ability to perform computations in memory, as well as an easy-to-use programming interface in the Scala language.</p>
<p>Ten months later, we&#8217;re excited to see how far the Spark community has grown. In particular, it&#8217;s passed that threshold where we knew each user personally, to reach a point where we primarily get questions, code contributions, and feature requests from users that we <em>don&#8217;t</em> know. Most awesome of those are the ones that start with &#8220;I tried Spark and it rocked&#8221; before asking a question. This is great not just because it improves the software, but because it&#8217;s often let us discover new use cases that we didn&#8217;t anticipate, and led to new research. For example, the <a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Managing-Data-Transfers-in-Computer-Clusters-with-Orchestra.pdf">Orchestra</a> work at SIGCOMM 2011 was partly motivated by users running machine learning algorithms that required large broadcasts.</p>
<p>To keep in touch with the community, we&#8217;ve started hosting a regular <a href="http://www.meetup.com/spark-users/">Spark User Meetup</a>, which is a mix of tutorials, presentations of upcoming features, presentations from users, and Q&amp;A. The first two meetups were held at <a href="http://klout.com/">Klout</a> and <a href="http://www.conviva.com/">Conviva</a>. I wanted to give a quick summary of the topics presented for those who missed them:</p>
<p><strong>January Meetup</strong></p>
<ul>
<li>Matei Zaharia from Berkeley talked about the goals of the Spark project and gave a tutorial on how to set it up locally or on Amazon EC2. The goal was to show people where the various new features we are developing fit in and where to find help on how to use what&#8217;s already there. The <a href="http://files.meetup.com/3138542/Introduction%20and%20Tutorial.pptx">slides</a> are available online (PPTX).</li>
<li>Karthik Thiyagarajan from <a href="http://quantifind.com/">Quantifind</a> explained how they are starting to use Spark for realtime exploration of time series data. Quantifind is a startup that offers predictive analytics &#8212; identifying trends from time series to make decisions. They use Spark as an almost realtime database service, where new data is ingested periodically and can be queried interactively from a web interface. Check out Karthik&#8217;s <a href="http://files.meetup.com/3138542/Quantifind%20Spark%20User%20Group%20Talk.pdf">slides</a> for more details. This is one of the use cases we&#8217;re really interested in exploring further.</li>
</ul>
<p><strong>February Meetup</strong></p>
<ul>
<li>This was the first unveiling of <a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf">Shark</a>, a port of Apache Hive onto Spark that we are developing. Hive is a popular large-scale data warehouse that provides a SQL interface for running queries on Hadoop MapReduce. With Shark, we can run the same queries over cached in-memory data in Spark, leading to up to 10x better performance for interactive data mining. One neat thing about the port is that it&#8217;s backwards-compatible with Hive, using the same language, user-defined functions, and metadata store, so it can run seamlessly on existing Hive data. Cliff Engle from the Shark team gave a <a href="http://files.meetup.com/3138542/Shark%20-%20Spark%20Meetup%202-23-12.pptx">talk</a> (PPTX) on Shark&#8217;s design and some initial results.</li>
<li>Dilip Joseph from <a href="http://www.conviva.com/">Conviva</a> talked about their use of Spark for a variety of reporting and analytics applications. Conviva provides video streaming optimization and management systems that need to deliver high-quality live video to thousands of concurrent viewers. They use Spark in combination with Hadoop and Hive to analyze the large sets of resulting logs, compute statistics over the data, and identify problems or optimization opportunities. They were some of the first production users of Spark, and today, they run 30% of their reports in Spark. Check out Dilip&#8217;s <a href="http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva">blog post</a> on the Conviva engineering blog for more details.</li>
</ul>
<p>Both meetups had full rooms with over 40 people from attending, which is great. We&#8217;re hoping to hold the next meetup in the first or second week of April. Please sign up for the <a href="http://www.meetup.com/spark-users/">meetup.com group</a> if you&#8217;re interested.</p>
<p>In other Spark news, a <a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf">demo paper</a> on Shark, the Hive-on-Spark port we are developing, was accepted at the SIGMOD conference, while a <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">paper</a> on Spark itself will appear at NSDI. Additionally, the <a href="http://incubator.apache.org/mesos/">Mesos</a> cluster manager developed in our lab, which Spark runs over, called its first Apache release vote today to release version 0.9, a major milestone that contains new usability, fault tolerance, and stability features developed at Twitter. Finally, Spark and Shark talks were accepted at <a href="http://days2012.scala-lang.org/">Scala Days 2012</a> (in London, England) and the <a href="http://hadoopsummit.org/">Hadoop Summit</a> (in Sunnyvale, CA). If you&#8217;re at those conferences and want to learn more about Spark, please drop by!</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/03/28/news-from-the-first-two-spark-user-meetups/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sweet Storage SLOs with Frosting</title>
		<link>http://amplab.cs.berkeley.edu/2012/03/19/sweet-storage-slos-with-frosting/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/03/19/sweet-storage-slos-with-frosting/#comments</comments>
		<pubDate>Tue, 20 Mar 2012 06:28:06 +0000</pubDate>
		<dc:creator>awang</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[SLO]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1628</guid>
		<description><![CDATA[A typical page load requires sourcing and combining many pieces of data. For example, a frontend application like a newsfeed requires making many storage requests to fetch your name, your photo, your friends and their photos, and your friends&#8217; most &#8230; <a href="http://amplab.cs.berkeley.edu/2012/03/19/sweet-storage-slos-with-frosting/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A typical page load requires sourcing and combining many pieces of data. For example, a frontend application like a newsfeed requires making many storage requests to fetch your name, your photo, your friends and their photos, and your friends&#8217; most recent posts. Since loading a page requires making many of these storage requests, controlling storage request latency is crucial to reducing overall page load times.</p>
<p>Storage systems are thus provisioned and tuned to meet these latency requirements. However, this requires provisioning for peak, not average, load. This means that the hardware is often underutilized. MapReduce batch analytics jobs are perfect for taking up this excess <em>slack</em> capacity in the system. However, traditional storage systems are unable to support both a MapReduce and frontend workload without adversely affecting frontend latency. This is compounded by the dynamic, time-variant nature of a frontend workload, which makes it difficult to tune the storage system for a single set of conditions.</p>
<p>This is what motivated our work on Frosting. Frosting is a request scheduling layer on top of <a title="HBase" href="http://hbase.apache.org/" target="_blank">HBase</a>, a distributed column-store, which dynamically tunes its internal scheduling to meet the requirements of the current workload. Application programmers directly specify high-level performance requirements to Frosting in the form of <em>service-level objectives (SLOs)</em>, which are throughput or latency requirements on operations to HBase. Frosting then carefully admits requests to HBase such that these SLOs are met.</p>
<p>In the case of combining a high-priority, latency-sensitive frontend workload and a low-priority, throughput-oriented MapReduce workload, Frosting will continually monitor the frontend&#8217;s empirical latency and only admit requests from MapReduce when the frontend&#8217;s SLO is satisfied. For instance, if the frontend is easily meeting its latency target, Frosting might choose to admit more MapReduce requests since there is slack capacity in the system. If the frontend latency increases above its SLO due to increased load, Frosting will accordingly admit fewer MapReduce requests.</p>
<p>This is ongoing work with Shivaram Venkataraman (shivaram@eecs) and Sara Alspaugh (alspaugh@eecs). No paper is yet available publicly. However, we&#8217;d love to talk to you if you&#8217;re interested in Frosting, especially if you use HBase in production, or have workload traces that we could get access to.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/03/19/sweet-storage-slos-with-frosting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Energy Debugging with Carat Enters Beta</title>
		<link>http://amplab.cs.berkeley.edu/2012/02/15/energy-debugging-with-carat-enters-beta/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/02/15/energy-debugging-with-carat-enters-beta/#comments</comments>
		<pubDate>Thu, 16 Feb 2012 01:25:39 +0000</pubDate>
		<dc:creator>Adam Oliner</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1499</guid>
		<description><![CDATA[Carat is a new research project in the AMP Lab that aims to detect energy bugs&#8212;app behavior that is consuming energy unnecessarily&#8212;using data collected from a community of mobile devices. Carat provides users with actions they can take to improve battery life (and the &#8230; <a href="http://amplab.cs.berkeley.edu/2012/02/15/energy-debugging-with-carat-enters-beta/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;"><a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/02/carat-icon.png"><img class="size-full wp-image-1510 alignright colorbox-1499" title="carat-icon" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/02/carat-icon.png" alt="" width="80" height="80" /></a><a title="Carat" href="http://carat.cs.berkeley.edu/">Carat</a> is a new research project in the AMP Lab that aims to detect energy bugs&#8212;app behavior that is consuming energy unnecessarily&#8212;using data collected from a community of mobile devices. Carat provides users with actions they can take to improve battery life (and the expected improvements).</p>
<p><img class="alignright size-full wp-image-1502 alignnone colorbox-1499" title="carat-actions" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/02/carat-actions.png" alt="" width="320" height="480" /></p>
<p style="text-align: left;">Carat collects usage data on devices (we care about <a title="Carat Privacy Policy" href="http://carat.cs.berkeley.edu/#privacy">privacy</a>), aggregates these data in the cloud, performs a statistical analysis using <a title="Spark" href="http://www.spark-project.org/">Spark</a>, and reports the results back to users. In addition to the Action List shown in the figure, the app empowers users to dive into the data, answering questions like How does my energy use compare to similar devices? and What specific information is being sent to the server?</p>
<p style="text-align: left;">The key insight of our approach is that we can acquire implicit statistical specifications of what constitutes &#8220;normal&#8221; energy use under different circumstances. This idea of statistical debugging has been applied to correctness and performance bugs, but this is the first application to energy bugs. The project faces a number of interesting (and sometimes distinguishing) technical challenges, such as accounting for sampling bias, reasoning with noisy and incomplete information, and providing users with an experience that rewards them for participating.</p>
<p>We need your help testing our iOS implementation and gathering some initial data! If you have an iPhone or iPad with iOS 5.0 or later and are willing to give us a few minutes of your time, <a title="Carat Beta Test" href="http://bit.ly/zbKT5u">please click here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/02/15/energy-debugging-with-carat-enters-beta/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting It All from the Crowd</title>
		<link>http://amplab.cs.berkeley.edu/2012/02/14/getting-it-all-from-the-crowd/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/02/14/getting-it-all-from-the-crowd/#comments</comments>
		<pubDate>Wed, 15 Feb 2012 04:52:59 +0000</pubDate>
		<dc:creator>Beth Trushkowsky</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1479</guid>
		<description><![CDATA[What does a query result mean when the data comes from the crowd? This is one of the fundamental questions raised by CrowdDB, a hybrid human/machine database system developed here in the AMP lab. For example, consider what could be thought of &#8230; <a href="http://amplab.cs.berkeley.edu/2012/02/14/getting-it-all-from-the-crowd/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What does a query result mean when the data comes from the crowd? This is one of the fundamental questions raised by <a href="http://amplab.cs.berkeley.edu/projects/crowddb-answering-queries-with-crowdsourcing/" target="_blank">CrowdDB</a>, a hybrid human/machine database system developed here in the AMP lab. For example, consider what could be thought of as the simplest query: SELECT * FROM <em>table</em>. If tuples are being provided by the crowd, how do you know when the query is complete? Can you really get them all?</p>
<p>In traditional database systems, query processing is based on the closed-world assumption: all data relevant to a query is assumed to reside in the database. When the data is crowdsourced, we find that this assumption no longer applies; existing information could be extended by further by asking the crowd. However, in our <a href="http://arxiv.org/abs/1202.2335" target="_blank">current work</a> we show that it is possible to understand query results in the &#8220;open-world&#8221; by reasoning about query completeness and the cost-benefit tradeoff of acquiring more data.</p>
<p>Consider asking workers on a crowdsourcing platform (e.g., Amazon Mechanical Turk) to provide items in a set (one at a time). As you can imagine, answers arriving from the crowd follow a pattern of diminishing returns: initially there is a high rate of arrival for previously unseen answers, but as the query progresses the arrival rate of new answers begins to taper off. The figure below shows an example of this curve when we asked workers to provide names of the US States.</p>
<div class="wp-caption aligncenter" style="width: 310px"><img class="  colorbox-1479" title="US States" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/02/sac_states_avg.jpg" alt="average SAC from States experiment" width="300" height="256" /><p class="wp-caption-text">Number of unique answers seen vs. total number of answers in the US States experiment (average)</p></div>
<p>This behavior is well-known in fields such as biology and statistics, where this type of figure is known as the Species Accumulation Curve (SAC). This analysis is part of the species estimation problem; the goal is to estimate the number of distinct species using observations of species in the locale of interest. We apply these techniques in the new context of crowdsourced queries by drawing an analogy between observed species and worker answers from the crowd. It turned out that the estimation algorithms sometimes fail due to crowd-specific behaviors like some workers providing many more answers than others (&#8220;streakers vs. samplers&#8221;). We address this by designing a heuristic that reduces the impact of overzealous workers. We also demonstrate a heuristic to detect when workers are consulting the same list on the web, helpful if the system wants to switch to another data gathering regime like webpage scraping.</p>
<p>Species estimation techniques provide a way to reason about query results, despite being in the open world. For queries with a bounded result size, we can form a progress estimate as answers arrive by predicting the cardinality of the result set. Of course, some sets are very large or contain items that few workers would think of or find (the long tail), so it does not make sense to try to predict set size. For these cases, we propose a pay-as-you-go-approach to directly consider the benefit of asking the crowd for more answers.</p>
<p>For more details, please check out the <a href="http://arxiv.org/abs/1202.2335" target="_blank">paper</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/02/14/getting-it-all-from-the-crowd/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Highlights From the AMPLab Winter 2012 Retreat</title>
		<link>http://amplab.cs.berkeley.edu/2012/01/20/highlights-from-the-amplab-winter-2012-retreat/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/01/20/highlights-from-the-amplab-winter-2012-retreat/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 17:50:13 +0000</pubDate>
		<dc:creator>franklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1389</guid>
		<description><![CDATA[The 2nd AMPLab research retreat was held Jan 11-13, 2012 at a mostly snowless Lake Tahoe.   120 people from 21 companies, several other schools and labs, and of course UC Berkeley spent 2.5 days getting an update on the &#8230; <a href="http://amplab.cs.berkeley.edu/2012/01/20/highlights-from-the-amplab-winter-2012-retreat/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPPoster.jpg"><br />
</a>The 2nd AMPLab research retreat was held Jan 11-13, 2012 at a mostly snowless Lake Tahoe.   120 people from 21 companies, several other schools and labs, and of course UC Berkeley spent 2.5 days getting an update on the current state of research in the lab, discussing trends and challenges in Big Data analytics, and sharing ideas, opinions and advice.   Unlike our first retreat, held last May, which was long on vision and inspiring guest speakers,  the focus of this retreat was current research results and progress.   Other than a few short overview/intro talks by faculty, virtually all of the talks (16 out of 17) were presented by students from the lab.   Some of these talks discussed research that had been recently published, but most of them discussed work that was currently underway, or in some cases, just getting started.</p>
<p style="text-align: center;"><a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPall.jpg"><img class="aligncenter  wp-image-1407 colorbox-1389" title="AMPall" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPall-1024x282.jpg" alt="" width="642" height="176" /></a></p>
<p>The first set of talks was focused on Applications.   Tim Hunter described how he and others used Spark to improve the scalability of the core traffic estimation algorithm used in the Mobile Millennium system, giving them the ability to run models faster than real-time and to scale to larger road networks.   Alex Kantchelian presented some very cool results for algorithmically detecting spam in tweet streams.    Matei Zaharia described a new algorithmic approach to Sequence Alignment called SNAP.   SNAP rethinks sequence alignment to exploit the longer reads that are being produced by modern sequencing machines and shows 10x to 100x speed ups over the state of the art, as well as improvements in accuracy.</p>
<p><img class="alignright colorbox-1389" title="AMPSki" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPSki1.jpg" alt="" width="382" height="254" /></p>
<p>The second technical session was about the Data Management portion of the BDAS (Berkeley Data Analytics System) stack that we are building in the AMPLab.    Newly-converted database professor Ion Stoica gave an overview of the components.   And then there were short talks on SHARK (an implementation of the Hive SQL processor on Spark),  Quicksilver &#8211; an approximate query answering system that is aimed at massive data,  scale-independent view maintenance in PIQL (the Performance Insightful Query Language), and a streaming (i.e., very low-latency) implementation of Spark.  These were presented by Reynold Xin, Sameer Agarwal, Michael Armbrust and Matei Zaharia, respectively.   Undergrad Ankur Dave wrapped up the session by wowing the crowd with a live demo of the Spark Debugger that he built &#8211; showing how the system can be used to isolate faults in some pretty gnarly, parallel data flows.</p>
<p>The Algorithms and People parts of the AMP agenda were represented in the 3rd technical session.  John Ducci presented his results on speeding up stochastic optimization for a host of applications.  He developed a parallelized method for introducing random noise into the process that leads to faster convergence.    Fabian Wauthier reprised his recent NIPS talk on detecting and correcting for Bias in crowdsourced input.   Beth Trushkowsky talked about &#8220;Getting it all from the Crowd&#8221;, and showed how we must think differently about the meaning of queries in a hybrid machine/human database system such as CrowdDB.</p>
<p><img class="alignright colorbox-1389" title="AMPAli" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPAli-1024x682.jpg" alt="" width="420" height="280" />A session on Machine-focused topics included talks by Ali Ghodsi on the PacMan caching approach for map-reduce style workloads,  Patrick Wendell on early work on low-latency scheduling of parallel jobs, Mosharaf Chowdhury on fair sharing of network resources in large clusters, and Gene Pang on a new programming model and consistency protocol for applications that span multiple data centers.</p>
<p>The technical talks were rounded out by two presentations from students who worked with partner companies to get access to real workloads, logs and systems traces.  Yanpei Chen talked about an analysis of the characteristics of various MapReduce loads from a number of sources.   Ari Rabkin presented an analysis of trouble tickets from Cloudera.</p>
<p>As always, we got a lot of feedback from our Industrial attendees.   A vigorous debate broke out about the extent to which the lab should work on producing  a complete, industrial-strength analytics stack.   Some felt we should aim to match the success of earlier high-impact projects coming out of Berkeley, such as BSD and Ingres.  Others insisted that we focus on high-risk, further out research and leave the systems building to them.   There were also discussions about challenge applications (such as the Genomics X  Prize competition) and how to ensure that we achieve the high degree of integration among the Algorithms, Machines and People components of the work, which is the hallmark of our research agenda.   Another topic of great interest to the Industrial attendees was around how to better facilitate interactions and internships with the always amazing and increasingly in-demand students in the lab.</p>
<p><a href="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPPoster.jpg"> <img class="alignright colorbox-1389" title="AMPPoster" src="http://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/AMPPoster-1024x682.jpg" alt="" width="420" height="279" /></a>From a logistical point of view, we tried a few new things.   The biggest change was  with the poster session(s).   As always, the cost of admission for students was to present a poster of their current research.   This year, however, we also invited visitors to submit posters describing relevant work at their companies in general, and projects for which they were looking to hire interns in particular.   We then partitioned the posters into two separate poster sessions (one each night), thereby giving everyone a chance to spend more time discussing the projects that they were most interested in while still getting a chance to survey the wide scope of work being presented.   Feedback on both of these changes was overwhelmingly positive.  So we&#8217;ll likely stick to this new format for future retreats.</p>
<p>Kattt Atchley, Jon Kuroda and Sean McMahon did a flawless job of organizing the retreat.   Thanks to them and all the presenters and attendees for making it a very successful event.</p>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/01/20/highlights-from-the-amplab-winter-2012-retreat/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Traffic jams, cell phones and big data</title>
		<link>http://amplab.cs.berkeley.edu/2012/01/18/traffic-jams-cell-phones-and-big-data/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/01/18/traffic-jams-cell-phones-and-big-data/#comments</comments>
		<pubDate>Wed, 18 Jan 2012 05:41:00 +0000</pubDate>
		<dc:creator>Timothy Hunter</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[mobilemillennium]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1350</guid>
		<description><![CDATA[(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly) It is well known that big data processing is becoming increasingly important in many scientific fields including astronomy, biomedicine and climatology.  In addition, newly created hybrid disciplines like biostatisics are an even stronger indicators of this &#8230; <a href="http://amplab.cs.berkeley.edu/2012/01/18/traffic-jams-cell-phones-and-big-data/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><em>(With contributions from Michael Armbrust, Leah Anderson and Jack Reilly)</em></p>
<p>It is well known that big data processing is becoming increasingly important in many scientific fields including <a href="http://www.ncsa.illinois.edu/News/Stories/big_data/" target="_blank">astronomy</a>, <a href="http://commonfund.nih.gov/InnovationBrainstorm/post/Meeting-the-Challenge-of-Big-Data-in-Biomedical-and-Translational-Science-(see-e2809cCross-Cutting-Issues-in-Computation-and-Informaticse2809d-in-Innovation-Brainstorm-ideas).aspx" target="_blank">biomedicine</a> and <a href="http://strataconf.com/strata2012/public/schedule/detail/22511" target="_blank">climatology</a>.  In addition, newly created hybrid disciplines like biostatisics are an even stronger indicators of this overall trend. Other fields like civil engineering, and in particular transportation, are no exception to the rule and the AMP Lab is actively collaborating with the department of <a href="http://its.berkeley.edu/" target="_blank">Intelligent Transportation Systems at Berkeley</a> to explore this new frontier.</p>
<p>It comes as no surprise to residents of California that congestion on the streets is a major challenge that affects everyone. While it is well studied for highways, it remains an open question for urban streets (also called the arterial road network). So far, the most promising source of data is the GPS of cellphones. However, a large volume of this very noisy data is required in order to maintain a good level of accuracy. The rapid adoption of smartphones, all equipped with GPS, is changing the game. I introduce in this post some ongoing efforts to combine <a title="The Mobile Millennium project" href="http://traffic.berkeley.edu/">Mobile Millennium</a>, a state-of-the-art transportation framework, with the AMPLab software stack.</p>
<p>What does this GPS data look like? Here is an example in the San Francisco Bay area: a few hundred taxicabs relay their position every minute in real time to our servers.<div class="myvideotag" style="width: 640px;"><iframe width="640" height="390" src="http://www.youtube.com/embed/OxCPL4KsDfI" frameborder="0" allowfullscreen></iframe></div></p>
<p>The precise trajectories of the vehicles are unobserved and need to be reconstructed using a sophisticated map matching pipeline implemented in Mobile Millennium. The results of this process are some timestamped trajectory segments. These segments are the basic observations to predict traffic.<div class="myvideotag" style="width: 640px;"><iframe width="640" height="390" src="http://www.youtube.com/embed/tj53gGCCNgs" frameborder="0" allowfullscreen></iframe></div><br />
Our traffic estimation algorithms work by guessing a probability distribution of the travel time on each link of the road network. This process is iterated to improve the quality of estimates. This overall algorithm is intensive both in terms of computations and memory. Fortunately, it also fits into the category of &#8220;embarrassingly parallel&#8221; algorithms and is a perfect candidate for distributed computing.<br />
Implementing a high-performance algorithm as a distributed system is not an easy task. Instead of implementing this by hand, our implementation relies on <a href="http://spark-project.org">Spark</a>, a programming framework in Scala. Thanks to the Spark framework, we were able to port our single machine implementation to the EC2 cloud within a few weeks to achieve nearly linear scaling. In a future post, I will discuss some practical considerations we faced to when  integrating the AMPLab stack with the Mobile Millennium system.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/01/18/traffic-jams-cell-phones-and-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Trip Report from the NIPS Big Learning Workshop</title>
		<link>http://amplab.cs.berkeley.edu/2012/01/16/trip-report-from-the-nips-big-learning-workshop/</link>
		<comments>http://amplab.cs.berkeley.edu/2012/01/16/trip-report-from-the-nips-big-learning-workshop/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 23:22:43 +0000</pubDate>
		<dc:creator>Matei Zaharia</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1360</guid>
		<description><![CDATA[A few weeks ago, I went to the Big Learning workshop at NIPS, held in Spain. The workshop brought together researchers in large-scale machine learning, an area near and dear to the AMP Lab&#8217;s goal of integrating Algorithms, Machines, and People to tame big &#8230; <a href="http://amplab.cs.berkeley.edu/2012/01/16/trip-report-from-the-nips-big-learning-workshop/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A few weeks ago, I went to the <a href="http://biglearn.org">Big Learning workshop</a> at NIPS, held in Spain. The workshop brought together researchers in large-scale machine learning, an area near and dear to the AMP Lab&#8217;s goal of integrating Algorithms, Machines, and People to tame big data, and contained a lot of interesting work. There were about ten invited talks and ten paper presentations. I myself gave an <a href="http://www.cs.berkeley.edu/~matei/talks/2011/biglearn_spark_novideo.pptx">invited talk</a> on <a href="http://www.spark-project.org">Spark</a>, our framework for large-scale parallel computing, which won a runner-up best presentation award.</p>
<p>The topics presented ranged from FPGAs to accelerate vision algorithms in embedded devices, to GPU programming, to cloud computing on commodity clusters. For me, some highlights included the discussion on <a href="http://biglearn.org/files/slides/contributed/murray.pdf">training the Kinect pose recognition algorithm using DryadLINQ</a>, which ran on several thousand cores and had to overcome substantial fault mitigation and I/O challenges; and the <a href="http://biglearn.org/files/slides/invited/guestrin.pptx">GraphLab presentation</a> from CMU, which discussed many interesting applications implemented using their asynchronous programming model. Daniel Whiteson from UC Irvine also gave an extremely entertaining talk on <a href="http://biglearn.org/files/slides/invited/whiteson.pdf">the role of machine learning in the search for new subatomic particles</a>.</p>
<p>One of the groups I was happy to see represented was the <a href="http://www.scala-lang.org">Scala</a> programming language team from EPFL. Scala features prominently as a high-level language for parallel computing. We use it in the Spark programming framework in our lab, as well as the <a href="https://database.cs.wisc.edu/cidr/cidr2009/Paper_86.pdf">SCADS</a> scalable key-value store. It&#8217;s also used heavily in the Pervasive Parallelism Lab at a certain school across the bay. It was good to hear that the Scala team is working on new features that will make the language easier to use as a DSL for parallel computing, making it simpler to build highly expressive programming tools in Scala such as Spark.</p>
<p>The AMP Lab was also represented by John Duchi, who presented a new algorithm for <a href="http://biglearn.org/files/slides/contributed/duchi.pdf">stochastic gradient descent in non-smooth problems</a> that is the first parallelizable approach for these problems, and Ariel Kleiner and Ameet Talwalkar, who presented the <a href="http://biglearn.org/files/slides/contributed/kleiner.pdf">Bag of Little Bootstraps</a>, a scalable bootstrap algorithm based on subsampling. It&#8217;s certainly neat to see two successes in parallelizing very disparate statistical algorithms one year into the AMP Lab.</p>
<p>In summary, the workshop showcased very diverse ideas and showed that big learning is a hot field. It was the biggest workshop at NIPS this year. In the future, as users gain experience with the various programming models and the best algorithms for each problem type are found, we expect to see some consolidation of these ideas into unified<br />
stacks of composable tools. Designing and building such a stack is one of the main goals of our lab.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2012/01/16/trip-report-from-the-nips-big-learning-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An AMP Blab about some recent system conferences &#8211; Part 3: Hadoop World 2011</title>
		<link>http://amplab.cs.berkeley.edu/2011/11/29/an-amp-blab-about-some-recent-system-conferences-part-3-hadoop-world-2011/</link>
		<comments>http://amplab.cs.berkeley.edu/2011/11/29/an-amp-blab-about-some-recent-system-conferences-part-3-hadoop-world-2011/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 21:34:17 +0000</pubDate>
		<dc:creator>ychen</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://amplab.cs.berkeley.edu/?p=1157</guid>
		<description><![CDATA[I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student &#8211; in no way does &#8230; <a href="http://amplab.cs.berkeley.edu/2011/11/29/an-amp-blab-about-some-recent-system-conferences-part-3-hadoop-world-2011/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I recently had the pleasure of visiting Portugal for SOSP/SOCC, and New York for Hadoop World. Below are some bits that I found interesting. This is the personal opinion of an AMP Lab grad student &#8211; in no way does it represent any official or unanimous AMP Lab position.</p>
<p><strong>Part 3: Hadoop World 2011</strong></p>
<p>Not exactly a research conference, Hadoop World is a multi-track industry convention hosted by Cloudera, an enterprise Hadoop vendor, and draws various companies with some stake in the Hadoop community. This year&#8217;s Hadoop World saw some 1500 attendees, including Hadoop vendors, Hadoop users, executives from various companies, vendors building on top of Hadoop, people looking to learn more about Hadoop, and of course, a small contingent of researchers. I believe Hadoop World is a good place for researchers to get a state-of-the-industry view of the big data, big systems space.</p>
<p>One theme is that Hadoop has really become &#8220;mainstream&#8221;, and moved much beyond its initial use cases in supporting e-commerce type services. The convention <a href="http://www.hadoopworld.com/agenda/" target="_blank">agenda</a> included talks from household names beyond typical high-tech industries. The talks also had audiences in ripped jeans and flip flops sitting next to others in pressed three piece suits, indicating the present diversity of the community, and perhaps pointing to opportunities for multi-disciplinary collaboration in the near future.</p>
<p>Accel Partners announced a $100M &#8220;Big data fund&#8221; to accelerate innovation in all layers of the &#8220;big data stack&#8221;. This should be of interest to entrepreneurial-minded students within the Lab.</p>
<p>Another theme is that Hadoop is still waiting for a &#8220;killer app&#8221;. One keynote speaker dubbed 2012 to be &#8220;the year of apps&#8221;. In other words, the Hadoop infrastructure is sufficient to be &#8220;enterprise ready&#8221;; therefore innovation should now focus on using Hadoop to derive business value.</p>
<p>Also, the &#8220;data scientist&#8221; role is gaining prominence. Jeff Hammerbacher pioneered this role at Facebook. Companies across many industries are looking for similarly skilled people to make sense of the data deluge that&#8217;s happening everywhere. This role requires some combination of expertise in computer science, statistics, social science, natural science, business, and other skills. AMP Lab is rooted in computer science and statistics, and depending on individual students interests, also reasonably literate in social science/natural science/business areas. I certainly found it motivational to see the countless ways that the Lab&#8217;s expertise can be applied to create business value, help improve the quality of life, and even discover new knowledge.</p>
<p>NetApp and Cloudera announced a partnership in providing the NetApp Open Solution for Hadoop running on Cloudera Distribution including Apache Hadoop. It&#8217;s great to see increased collaboration between our industry partners beyond knowledge sharing through the Lab.</p>
<p>I gave a joint talk on &#8220;Hadoop and Performance&#8221; with Todd Lipcon, our colleague from Cloudera. The talk was well received, and folks are looking forward to our imminent release of the &#8220;Cloudera Hadoop workload suite&#8221;. One could say that the focus of typical enterprises should be either profit (monetary and societal), or arguing that &#8220;my-performance-is-better&#8221;. Thus, it remains the academic community&#8217;s responsibility and opportunity to develop scientific design and performance evaluation methodologies.</p>
<p>No travel notes this time.</p>
]]></content:encoded>
			<wfw:commentRss>http://amplab.cs.berkeley.edu/2011/11/29/an-amp-blab-about-some-recent-system-conferences-part-3-hadoop-world-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

