Aggressive Data Skipping for Querying Big Data

Posted on October 23, 2014 by Liwen Sun

Error: Unable to create directory uploads/2025/07. Is its parent directory writable by the server?

As data volumes continue to expand, analytics approaches that require exhaustively scanning data sets become untenable. For this reason, we have been developing data organization techniques that make it possible to avoid looking at large volumes of irrelevant data. Our work in this area, which we call “Aggressive Data Skipping” recently got picked up by O’Reilly Radar: Data Today: Behind on Predictive Analytics, Aggressive Data Skipping + More. In this post, I give a brief overview the approach and provide references to more detailed publications.

Data skipping is an increasingly popular technique used in modern analytics databases, including IBM DB2 BLU, Vertica, Spark and many others. The idea is very simple: big data files are partitioned into fairly small blocks of say, 10,000 rows. For each such block we store some metadata, e.g., the min and max of each column. Before scanning each block, a query can first check the metadata and then decide if the block possibly contains records that are relevant to the query. If the metadata indicates that no such records are contained in the block, then the block does not need to be read, i.e, it can be skipped altogether.

In our work we focus on maximizing the amount of data that can be skipped (hence the name “Aggressive Data Skipping”). The key to our approach is Workload Analysis. That is, we observe the queries that are presented to the system over time, and then make partitioning decisions based on what is learned from those observations. Our workload-driven fine-grained partitioning framework re-orders the rows at data loading time.

In order to maximize the chance of data skipping, our research answers the following questions:

what partitioning method is appropriate for generating fine-grained blocks
what kind of (concise) metadata can we store for supporting arbitrary filters (e.g., string matching or UDF filters)

As shown in the figure below, our approach uses the following “W-A-R-P” steps:

The Partitioning Framework

Workload analysis: We extract the frequently recurring filter patterns, which we call the features, from the workload. The workload can be a log of past ad-hoc queries or a collection of query templates from which daily reporting queries are generated.
Augmentation: For each row, we compute a bit vector based on the features and augment the row with this vector.
Reduce: We group together the rows with the same bit vectors, since the partitioning decision will be solely based on the bit vectors rather than the actual data rows.
Partition: We run a clustering algorithm on the bit vectors and generate a partitioning scheme. The rows will be routed to their destination partitions guided by this partitioning scheme.

After we have partitioned the data, we store a feature bit vector for each partition as metadata. The following figure illustrates how data skipping works during query execution.

Data skipping during query execution

When a query comes, our system first checks which features are applicable for data skipping. With this information, the query processor then goes through the partition-level metadata (i.e., the bit vectors) and decides which partitions can be skipped. This process can work in conjunction with existing data skipping based on min/max.

We prototyped this framework on Shark and our experiments with TPC-H and a real-world dataset show speed ups of 2x to 7x. An example result from the TPC-H benchmark (measuring average query response time over 80 TPC-H queries) is shown below.

Query Response Time

For more technical details and results, please read our SIGMOD 14 paper, or if you hate formalism and equations, we also gave a demo in VLDB 14. Feel free to send an email to liwen@cs.berkeley.edu for any questions or comments on this project.

Big Data, Hype, the Media and Other Provocative Words to Put in a Title

Posted on October 22, 2014 by Michael Jordan

Error: Unable to create directory uploads/2025/07. Is its parent directory writable by the server?

I’ve found myself engaged with the Media recently, first in the context of a
“Ask Me Anything” (AMA) with reddit.com http://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/ (a fun and engaging way to spend a morning), and then for an interview that has been published in the IEEE Spectrum.

That latter process was disillusioning. Well, perhaps a better way to say it is that I didn’t harbor that many illusions about science and technology journalism going in, and the process left me with even fewer.

The interview is here: http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts

Read the title and the first paragraph and attempt to infer what’s in the body of the interview. Now go read the interview and see what you think about the choice of title.

Here’s what I think.

The title contains the phrase “The Delusions of Big Data and Other Huge Engineering Efforts”. It took me a moment to realize that this was the title that had been placed (without my knowledge) on the interview I did a couple of weeks ago. Anyway who knows me, or who’s attended any of my recent talks knows that I don’t feel that Big Data is a delusion at all; rather, it’s a transformative topic, one that is changing academia (e.g., for the first time in my 25-year career, a topic has emerged that almost everyone in academia feels is on the critical path for their sub-discipline), and is changing society (most notably, the micro-economies made possible by learning about individual preferences and then connecting suppliers and consumers directly are transformative). But most of all, from my point of view, it’s a *major engineering and mathematical challenge*, one that will not be solved by just gluing together a few existing ideas from statistics, optimization, databases and computer systems.

I.e., the whole point of my shtick for the past decade is that Big Data is a Huge Engineering Effort and that that’s no Delusion. Imagine my dismay at a title that said exactly the opposite.

The next phrase in the title is “Big Data Boondoggles”. Not my phrase, nor my thought. I don’t talk that way. Moreover, I really don’t see anything wrong with anyone gathering lots of data and trying things out, including trying out business models; quite to the contrary. It’s the only way we’ll learn. (Indeed, my bridge analogy from later in the article didn’t come out quite right: I was trying to say that historically it was crucial for humans to start to build bridges, and trains, etc, etc, before they had serious engineering principles in place; the empirical engineering effort had immediate positive effects on humans, and it eventually led to the engineering principles. My point was just that it’s high time that we realize that wrt to Big Data we’re now at the “what are the principles?” point in time. We need to recognize that poorly thought-out approaches to large-scale data analysis can be just costly as bridges falling down. E.g., think individual medical decision-making, where false positives can, and already are, leading to unnecessary surgeries and deaths.)

Next, in the first paragraph, I’m implied to say that I think that neural-based chips are “likely to prove a fool’s errand”. Not my phrase, nor my thought. I think that it’s perfectly reasonable to explore such chip-building; it’s even exciting. As I mentioned in the interview, I do think that a problem with that line of research is that they’re putting architecture before algorithms and understanding, and that’s not the way I’d personally do things, but others can beg to differ, and by all I means think that they should follow their instincts.

The interview then proceeds along, with the interviewer continually trying to get me to express black-and-white opinions about issues where the only reasonable response is “gray”, and where my overall message that Big Data is Real but that It’s a Huge Engineering Challenge Requiring Lots of New Ideas and a Few Decades of Hard Work keeps getting lost, but where I (valiantly, I hope) resist. When we got to the Singularity and quantum computing, though—areas where no one in their right mind will imagine that I’m an expert—I was despairing that the real issues I was trying to have a discourse about were not really the point of the interview and I was glad that the hour was over.

Well, at least the core of the article was actually me in my own words, and I’m sure that anyone who actually read it realized that the title was misleading (at best).

But why should an entity such as the IEEE Spectrum allow an article to be published where the title is a flat-out contradiction to what’s actually in the article?

I can tell you why: It’s because this title and this lead-in attracted an audience.

And it was precisely this issue that I alluded to in my response to the first question—i.e., that the media, even the technology media that should know better, has become a hype-creator and a hype-amplifier. (Not exactly an original thought; I know…). The interviewer bristled, saying that the problem is that academics put out press releases that are full of hype and the poor media types don’t know how to distinguish the hype from the truth. I relented a bit. And, sure, he’s right, there does seem to be a growing tendency among academics and industrial researchers to trumpet their results rather than just report them.

But I didn’t expect to become a case in point. Then I saw the title and I realized that I had indeed become a case in point. I.e., here we have a great example of exactly what I was talking about—the media willfully added some distortion and hype to a story to increase the readership. Having the title be “Michael Jordan Says Some Reasonable, But Somewhat Dry, Academic, Things About Big Data” wouldn’t have attracted any attention.

(Well “Michael Jordan” and “Big Data” would have attracted at least some attention, I’m afraid, but you get my point.)

(As for “Maestro”, usually drummers aren’t referred to as “Maestros”, so as far as that bit of hyperbole goes I’m not going to complain… :-).

Anyway, folks, let’s do our research, try to make society better, enjoy our lives and forgo the attempts to become media darlings. As for members of the media, perhaps the next time you consider adding that extra dollop of spin or hype… Please. Don’t.

Mike Jordan

AMP Lab – UC Berkeley

Monthly Archives: October 2014

Aggressive Data Skipping for Querying Big Data

Big Data, Hype, the Media and Other Provocative Words to Put in a Title