How to Build A Bad Research Center

Error: Unable to create directory uploads/2025/07. Is its parent directory writable by the server?

The AMPLab is part of a Berkeley tradition of creating 5-year multidisciplinary projects that build prototypes to demonstrate the project vision and depend on biannual retreats for feedback and open shared space to inspire collaboration.

After being involved in a dozen centers over nearly 40 years, I decided to capture my advice on building and running research centers . Following the precedent of my past efforts at  “How to Give a Bad Talk” and “How to Have a Bad Career“, I just finished a short technical paper entitled “How to Build a Bad Research Center.”

As a teaser, below are my Eight Commandments to follow to build a bad research center:

  1. Thou shalt not mix disciplines in a center. It is difficult for people from different disciplines to talk to each other, as they don’t share a common culture or vocabulary.  Thus, multiple disciplines waste time, and therefore precious research funding. Instead, remain pure.
  2. Thou shalt expand centers. Expanse is measured geographically, not intellectually. For example, in the US the ideal is having investigators from 50 institutions in all 50 states, as this would make a funding agency look good to the US Senate.
  3. Thou shalt not limit the duration of a center. To demonstrate your faith in the mission of the center, you should be willing to promise to work on it for decades. (Or at least until the funding runs out.)
  4. Thou shalt not build a graven prototype. Integrating results in a center-wide prototype takes time away from researchers’ own, more important, private research.
  5. Thou shalt not disturb thy neighbors. Good walls make good researchers; isolation reduces the chances of being distracted from your work.
  6. Thou shalt not talk to strangers. Do not waste time convening meetings to present research to outsiders; following the 8th commandment, reviews of your many papers supply sufficient feedback.
  7. Thou shalt make decisions as a consensus of equals. The US Congress is a sterling example of making progress via consensus.
  8. Thou shalt honor thy paper publishers. Thus, to ensure center success, you must write, write, write and cite, cite, cite. If the conference acceptance rate is 1/X, then obviously you should submit at least X papers, for otherwise chances are that your center will not have a paper at every conference, which is a catastrophe.

Comparing Large Scale Query Engines

Error: Unable to create directory uploads/2025/07. Is its parent directory writable by the server?

A major goal of the AMPLab is to provide high performance for data-intensive algorithms and systems at large scale. Currently, AMP has projects focusing on cluster-scale machine learning, stream processing, declarative analytics, and SQL query processing, all with an eye towards improving performance relative to earlier approaches.

Whenever performance is a stated goal, measuring progress becomes a critical component of innovation. After all, how can one hope to improve that which is not measured [1]?

To meet this need, today we’re releasing a hosted benchmark that compares the performance of several large-scale query engines. Our initial benchmark is based on the well known 2009 Hadoop/RDMBS benchmark by Pavlo et al. It compares the performance of an analytic RDBMS (Redshift) to Hadoop-based SQL query engines (Shark, Hive, and Impala) on a set of relational queries.

The write-up includes performance numbers for a recent version of each framework. An example result, which compares response time of a high-selectivity scan, is given below:

benchmark_1a

Because these frameworks are all undergoing rapid development, we hope this will provide a continually updated yardstick with which to track improvements and optimizations.

To make the benchmark easy to run frequently, we have hosted it entirely in the cloud: We’ve made all datasets publicly available in Amazon’s S3 storage service and included turn-key support for loading the data into hosted clusters. This allows individuals to verify the results or even run their own queries. It is entirely open source, and we invite people to fork it, improve it, and contribute new frameworks and updates. Over time, we hope to extend the workload to include advanced analytics like machine learning and graph processing.

One thing to keep in mind: performance isn’t everything. Capabilities, maturity, and stability are also very important; the best performing system may not be the most useful. Unfortunately, these other characteristics are also harder to measure!

We hope you checkout the benchmark, it is hosted on the AMP website and will be updated regularly.