AMP BLAB: The AMPLab Blog

SNAP Helps Save A Life

Posted on June 4, 2014 by Patterson

When we got started in genomics 3 years ago, I wrote an essay in the New York Times that computer scientists have a lot to bring to the fight deadly diseases like cancer. (This hypothesis was not universally heralded by everyone in all fields.)

The good news we have already had a success of using SNAP for infectious disease to help save a life.

There are a number of patients in the hospital with undiagnosed inflammatory reactions and infectious symptoms who have no identifiable source using existing tests. A generic test to identify all organisms using RNA sequencing has been developed and is being piloted by Dr. Charles Chiu at UCSF. SNAP is critical to the implementation of this test, since it rapidly filters all human sequence from the large resulting datasets (up to 100 million sequences), enabling the identification of pathogenic sequence within a small enough time to effectively treat patients. In the US 20,000 people, mostly children, are hospitalized each year with brain-swelling encephalitis, and 60% of the time doctors never find the cause, which can lead to serious consequences.

Joshua Osborn, now 15, outside his home in Cottage Grove, Wis

This tool was recently used to successfully diagnose and treat a Joshua Osborn, a teenager with severe combined immunodeficiency who lives in Wisconsin. He went to hospital repeatedly, and was eventually hospitalized for 5 weeks without successful diagnosis. He developed brain seizures, so he was placed in a medically induced coma. In desperation, they asked his parents to try one more experimental test. His parents agreed, so they sampled his spinal fluid and sent it to UCSF for sequencing and analysis.

The speed and accuracy of SNAP helped UCSF to quickly filter out the human reads. In just 2 days they identified a rare infectious bacterium, which was only 0.02% of the original 3M reads. The boy was then treated with antibiotics for 10 days; he awoke and was discharged from the hospital 4 weeks later. Although our software is only part of this process, without SNAP it would not be possible to perform a general infectious disease test like this without first guessing the causative agent. That’s why tests such as this are not yet more broadly available.

Quoting from the UCSF press release, referring indirectly in part to the speed and accuracy of SNAP:

“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete,” Chiu said.

The abstract and last paragraph from the NEJM article tells the story with more medical accuracy and brevity:

A 14-year-old boy with severe combined immunodeficiency presented three times to a medical facility over a period of 4 months with fever and headache that progressed to hydrocephalus and status epilepticus necessitating a medically induced coma. Diagnostic workup including brain biopsy was unrevealing. Unbiased next-generation sequencing of the cerebrospinal fluid identified 475 of 3,063,784 sequence reads (0.016%) corresponding to leptospira infection. Clinical assays for leptospirosis were negative. Targeted antimicrobial agents were administered, and the patient was discharged home 32 days later with a status close to his premorbid condition. Polymerase-chain-reaction (PCR) and serologic testing at the Centers for Disease Control and Prevention (CDC) subsequently confirmed evidence of Leptospira santarosai infection.

… In summary, unbiased next-generation sequencing coupled with a rapid bioinformatics pipeline provided a clinically actionable diagnosis of a specific infectious disease from an uncommon pathogen that eluded conventional testing for months after the initial presentation. This approach thus facilitated the use of targeted and efficacious antimicrobial therapy.

In a separate communication with Chiu he said that in the United States, there are about 15,000 cases a year of brain-swelling encephalitis with 2,000 deaths, and >70% of the deaths underdiagnosed. Assuming doctors are able to obtain actionable diagnoses from the information, SNAP plus the software developed at UCSF to identify the non-human reads (SUPRI) could potentially save the lives hundreds of encephalitis patients annually just in the US. Worldwide, encephalitis is a huge problem; there are probably about 70,000 diagnosed cases a year with 25,000 deaths. Even 25,000 is certainly a gross underestimate because most cases worldwide are in rural areas and do not receive hospital care.

Here are links to

the New York Times article,

the press release from UCSF,

the New England Journal of Medicine article that describes the boy’s treatment, and

the technical paper in Genomics Research that describes the UCSF work that discovered the disease, which talks about the use of SNAP and SUPRI and cites our paper.

New job opportunity for a Solutions Architect

Posted on April 1, 2014 by massie

The AMPLab participates in the Defense Advanced Research Project Agency (DARPA) XDATA Program which supports the development of open-source systems for processing, analyzing and visualizing large, imperfect and incomplete data. Many of the XDATA teams rely on the Berkeley Data Analytics Stack (BDAS) as the foundation of their XDATA applications.

The AMPLab is looking for a Solutions Architect to work side-by-side with DARPA teams in application design and deployment on DARPA infrastructure. This is a great opportunity to work with a diverse group of researchers and software engineers in government, academia and industry to solve important real-world problems using open-source software.

The AMPLab has a culture of “causal intensity” which will be familiar to anyone from a tech startup. We work hard and play hard — flexible work hours, work from home days, team lunches and retreats twice a year in Tahoe and Santa Cruz. Retreats are attended by our sponsors who give us feedback on the tough data problems they’d like us to solve. This close connection to government and industry have allowed us to create systems, like Spark, Tachyon, Shark, GraphX, MLBase, with outstanding performance and features. All BDAS components are released under Apache, BSD or similar open-source licenses.

The AMPLab also collaborates with other Data Science teams across the Berkeley campus comprised of experts for a broad range of specialties including computer science, statistics, data visualization and social sciences. The Berkeley Institute for Data Science (BIDS) is a new initiative focused on developing innovative partnerships to advance technologies that support advanced data management and data analytic techniques.

This position requires occasional travel to Washington D.C. for the DARPA XDATA summer workshops. These workshops bring together teams from across the U.S. to build end-to-end solutions to specific data challenges. The AMPLab will, of course, cover all travel expenses. These onsite visits are an important part of creating systems that integrate together well. DARPA also brings in guest speakers from other agencies, such as the Pentagon, DEA and others, to talk about the data challenges they face.

To apply, visit the UC, Berkeley Jobs website and keyword search for 17573, the Job ID for this position.

Benefiting from Science is a Human Right!

Posted on March 10, 2014 by Patterson

I was just at the Global Alliance for Genomics and Health (GA4GH) last week in London, and a person from the regulatory and ethics group quoted the United Nations about the rights for people to benefit from scientific progress:

Article 27.1 of The Universal Declaration of Human Rights states
“Everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits.”

As almost all nations have signed this treaty, it has the power of law.

This declaration is a big deal in genomics because, under the argument of protecting their privacy, some experts would prevent people from benefiting from scientific progress by preventing them from sharing their genomes with scientists. The experts argue that people don’t understand what they are doing, and hence they must be protected from themselves. Article 27.1 is a counterargument, in that benefiting participating in and benefiting from scientific progress is a human right, and you must be careful not to trample human rights

PLANET: Making Progress with Commits in Unpredictable Environments

Posted on February 10, 2014 by gpang

Recent trends in databases, such as multi-tenancy and cloud databases, can contribute to increases in delays and variation for transaction response times. For geo-replicated databases, this issue can be even worse, since the network delays between data centers can be very unpredictable. Developers have two options when transactions are delayed unpredictably: they may choose to wait indefinitely for the transaction to complete, or they may timeout early and be uncertain of the transaction outcome. Neither outcome is desirable, so we propose a new transaction programming model to help with this situation.

planet_logo_small

We propose PLANET (Predictive Latency-Aware NEtworked Transactions), which is a new transaction programming model that helps developers write flexible transactions. By using PLANET, developers can write transactions to better deal with the variation in latencies possible in modern database environments. PLANET provides the developer with more information of the progress and stages of the transaction, so the developer can define how the transaction should behave in different situations. By exposing more details, PLANET enables developers to express user-defined commits for their transactions. PLANET also exposes a commit likelihood calculation for transactions, which can be utilized for user-defined commits as well as admission control. By using the features of PLANET, systems can provide responsive transactions, as well as better system utilization for cloud databases.

More information about PLANET can be found at our PLANET website. There, we have a simple interactive visualization to demonstrate how some of the features of PLANET work. We also have examples of several use cases of the transaction programming model.

New BDAS Features

Posted on February 7, 2014 by jegonzal

In this post we briefly describe four of the newest features in the Berkeley Data Analytics Stack (BDAS):

GraphX: large-scale interactive graph analytics
BlinkDB: real-time analytics through bounded approximations
MLbase: scalable machine learning library
Tachyon: reliable data sharing at memory speed across cluster frameworks

GraphX

GraphX is a distributed interactive graph computation system integrated with Spark. GraphX exposes a new API that treats tables and graphs as composable objects enabling users to easily construct graphs from tables, join graphs and tables, and apply iterative graph algorithms (e.g., PageRank and community detection) using Pregel like operators. The GraphX system combines recent advances in graph processing systems with distributed join optimizations and lineage to efficiently execute distributed graph computation in the context of fully fault-tolerant data-parallel platform. On top of GraphX we have built a collection of useful tools for graph analytics.

BlinkDB

BlinkDB is a large-scale data warehouse system built on Shark and Spark that aims to achieve real-time (i.e., sub-second) query response times for a variety of SQL-based aggregation queries (augmented by a time and/or error bound) on massive amounts of data. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results.

MLbase

The MLbase project’s aim is to provide fast, scalable, and easy to use Machine Learning on top of Spark and is composed of three core components. MLlib is the first production-ready component of the MLbase project and is the standard library for machine learning on Spark. Released as part of Spark 0.8.0, MLlib includes fast, scalable algorithms for classification, regression, clustering, collaborative filtering, and convex optimization. The second component of MLbase, MLI, is an API for distributed machine learning. By offering an API that is familiar to Machine Learning developers, MLI provides a DSL for development of new machine learning algorithms on top of Spark. The final component of MLbase, the MLbase Optimizer, attempts to automate the process of model selection and ML pipeline construction, and is an area of active research within the AMPlab.

Tachyon

Tachyon is a fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.

Learn more at Strata

At Strata 2014 we will be hosting AMPCamp4 with hands-on exercises (available online as well) to help people get started using BDAS and the exciting new features we have been developing.

You can register here for AMP Camp 4 and the 2014 Strata Conference. Use the code AMP20 when registering to get 20% off your ticket price. The conference offers one day of tutorials (Feb 11) and two says of presentations (Feb 12-13). Please make sure to select the tutorials day if you wish to join us at the AMP Camp.

Large scale data analysis made easier with SparkR

Posted on January 26, 2014 by shivaram

R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

Some of the key features of SparkR include:

RDDs as Distributed Lists: SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD. In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

Serializing closures: SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster.

Using existing R packages: SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster.

Putting these features together in R can be very powerful. For example, the code to compute Logistic Regression using gradient descent is listed below. In this example, we read a file from HDFS in parallel using Spark and run a user-defined gradient function in parallel using lapplyPartition. Finally the weights from different machines are accumulated using reduce.

pointsRDD <- readMatrix(sc, "hdfs://myfile")
# Initialize weights
weights <- runif(n=D, min = -1, max = 1)
# Logistic gradient
gradient <- function(partition) {
    X <- partition[,1]; Y <- partition[-1]
    t(X) %*% (1/(1 + exp(-Y * (X %*% w))) - 1) * Y
}
for (i in 1:10) {
    weights <- weights - reduce(lapplyPartition(pointsRDD, 
        gradient), "+")
}

Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at http://amplab-extras.github.io/SparkR-pkg.

Top 5 most influential works in data science?

Posted on January 23, 2014 by Patterson

As part of the Data-Driven Discovery Investigator Competition from the Gordon and Betty Moore Foundation, they ask for

five references to the most influential work in data science in the applicant’s view. This is distinct from the bio-sketch references and will not be factor in the Foundation’s decision-making. This information will help the Foundation better understand the influential ideas related to data-driven discovery and data science.

After talking to others in the lab, below is my list, sorted in order of citations according to Google Scholar. Love to hear comments on these and/or suggestions of others I missed.

1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … & Grafham, D. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. (16,000 citations)

The Human Genome Project turned the secret of life into into digital information. On January 14, 2014 Illumina announced a new sequencing machine that can do the wet lab processing of a genome for $1000. This price is widely believed to be a tipping point, and soon millions will have their genomes sequenced. At 25 to 250 gigabytes per genome, genetics is now Big Data.

2. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. (9,200 citations)

A simple, easy-to-use programming model to process Big Data. It led to the No-SQL movement, Hadoop, many startup companies, and awards for its authors.

3. Blei, D., Ng, A., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022. (7,300 citations)

LDA allows sets of observations to be explained by unobserved groups. It spawned an entire industry of data-driven discovery for text and image corpora.

4. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Ion Stoica & Zaharia, M. (2010). A view of cloud computing. Communications of the ACM, 53(4), 50-58. (5,800 citations)

At a time when there was confusion as to what cloud computing was, it defined cloud computing, explained why it occurred now, and listed its challenges and opportunities.

5. Stoughton, C., Lupton, R. H., Bernardi, M., Blanton, M. R., Burles, S., Castander, F. J., … & Carey, L. (2002). Sloan digital sky survey: Early data release. The Astronomical Journal, 123(1), 485. (2,100 citations)

Aided by computer scientist Jim Gray, astronomers made raw astronomical data available to a much wider community. It led to crowd-sourcing of astronomy through projects like Galaxy Zoo, so now anyone could help with astronomy research.

Got a Minute? Spin up a Spark cluster on your laptop with Docker.

Posted on October 23, 2013 by Andre Schumacher

Apache Spark and Shark have made data analytics faster to write and faster to run on clusters. This post will teach you how to use Docker to quickly and automatically install, configure and deploy Spark and Shark as well. How fast? When we timed it, we found it took about 42 seconds to start up a pre-configured cluster with several workers on a laptop. You can use our Docker images to create a local development or test environment that’s very close to a distributed deployment.

Docker provides a simple way to create and run self-sufficient Linux containers that can be used to package arbitrary applications. Its main advantage is that the very same container that runs on your laptop can be executed in the same way on a production cluster. In fact, Apache Mesos recently added support for running Docker containers on compute nodes.

Docker runs on any standard 64-bit Linux distribution with a recent kernel but can also be installed on other systems, including Mac OS, by adding another layer of virtualization. First run the Docker Hello World! example to get started.

Running Spark in Docker

The next step is to clone the git repository that contains the startup scripts.

$ git clone -b blogpost git@github.com:amplab/docker-scripts.git

This repository contains deploy scripts and the sources for the Docker image files, which can be easily modified. (Contributions from the community are welcome, just send a pull request!)

Fast track: deploy a virtual cluster on your laptop

Start up a Spark 0.8.0 cluster and fall into the Spark shell by running

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

and get a Spark cluster with two worker nodes and HDFS pre-configured. During the first execution Docker will automatically fetch container images from the global repository, which are then cached locally.

Further details

Running the deploy script without arguments shows command line options.

$ sudo ./docker-scripts/deploy/deploy.sh
usage: ./docker-scripts/deploy/deploy.sh -i <image> [-w <#workers>] [-v <data_directory>] [-c]

  image:    spark or shark image from:
                 amplab/spark:0.7.3  amplab/spark:0.8.0
                 amplab/shark:0.7.0  amplab/shark:0.8.0

The script either starts a standalone Spark cluster or a standalone Shark cluster with a given number of worker nodes. Hadoop HDFS services are started as well. Since services depend on a properly configured DNS, one container will automatically be started with a DNS forwarder. All containers are also accessible via ssh using a pre-configured RSA key.

If you want to make a directory on the host accessible to the containers — say to import some data into Spark — just pass it with the -v option. This directory is then mounted on the master and worker containers under /data.

Both the Spark and Shark shells are started in separate containers. The shell container is started from the deploy script by passing -c option but can also be attached later.

So let’s start up a Spark 0.8.0 cluster with two workers and connect to the Spark shell right away.

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

You should see something like this:

*** Starting Spark 0.8.0 ***
...
***********************************************************************
connect to spark via:       sudo docker run -i -t -dns 10.0.3.89 amplab/spark-shell:0.8.0 10.0.3.90

visit Spark WebUI at:       http://10.0.3.90:8080/
visit Hadoop Namenode at:   http://10.0.3.90:50070
***********************************************************************

Once the shell is up, let’s run a small example:

scala> val textFile = sc.textFile("hdfs://master:9000/user/hdfs/test.txt")
scala> textFile.count()
scala> textFile.map({line => line}).collect()

After you are done you can terminate the cluster from the outside via

$ sudo docker-scripts/deploy/kill_all.sh spark
$ sudo docker-scripts/deploy/kill_all.sh nameserver

which will kill all Spark and nameserver containers.

Notes

If you are running Docker via vagrant on Mac OS make sure to increase the memory allocated to the virtual machine to at least 2GB.

There is more to come

Besides offering lightweight isolated execution of worker processes inside containers via LXC, Docker also provides a kind of combined git and github for container images. Watch the AMPLab Docker account for updates.

Transactions, High Availability, and Scalability: Friends or Foes?

Posted on October 22, 2013 by Peter Bailis

Many of today’s data-intensive applications at scale require always-on operation and low-latency query execution, which has led to quite a shake-up in the data management space. While the battle between NoSQL, NewSQL, and “old SQL” involves many factors including market forces and data models, one factor is fundamental: some semantics are fundamentally at odds with the goals of always-on operation and low latency execution. Since at least 1985, the database community has recognized that the gold standard of ACID database transactions—serializability, providing single-system programmability—is unachievable with these goals: a system facing network partitions between servers cannot provide both serializability and guarantee a response from every server (what we’ll call “high availability”). In the absence of partitions, this high availability corresponds to the ability to provide low-latency operation, especially across datacenters. Even the latest and greatest production-ready serializable databases still top out at roughly 20 writes/record/second due to fundamental coordination bottlenecks.

ACID in the Real World

In recent work in the AMPLab and at Berkeley, we asked the question: are all transactional models incompatible with the goal of high availability? With the help of Alan Fekete, a database isolation whiz visiting us from the University of Sydney, we looked at transactional models in the real world and were surprised to find that, in fact, many popular databases didn’t even offer serializability as an option to end-users at all! In most cases, databases offered “weaker” models like Read Committed and Snapshot Isolation by default (sometimes labeling these models as “serializable”!). There is good reason for this phenomenon: these levels offer increased concurrency and often fewer system-induced aborts due to deadlocks and contention. However, these isolation models also expose a range of concurrency anomalies to end-users. Take a look at the summary in the chart below:

Enter Highly Available Transactions

Given that many real-world transactions aren’t serializable, we set out to understand which are achievable with high availability (i.e, as Highly Available Transactions, or HATs). To answer this question, we had to distinguish between unavailable implementations and unavailable semantics: for example, a lock-based implementation of Read Committed is unavailable under partial failures, but the accepted definition of Read Committed is achievable if we design concurrency control with availability as a top priority (e.g., for Read Committed, a database can buffer client writes until commit). Surprisingly, we found that many useful models like ANSI Repeatable Read and atomic multi-item writes and reads are also achievable with high availability (see Figure 2 of our paper) for more details. We also considered a modified but important availability model where users maintain affinity with a set of servers–we prove that this “sticky availability” is required for common guarantees like “Read Your Writes” and the increasingly popular Causal Consistency guarantees.

To determine the benefit of achieving high availability, we both analyzed the fundamental costs of coordination and benchmarked a prototype HAT implementation. The fundamental cost of giving up availability is that each operation requires at least one round-trip time to a coordination service (e.g., a lock manager or optimistic concurrency control validation service) to safely complete. In our measurements on EC2, this equated to a 10-100x increase in latency compared to HAT datacenter-local traffic. While our goal was not to design optimal implementations of HAT semantics, we were able to measure the overhead of a simple HAT implementation of Read Committed and atomic reads and writes (“Monotonic Atomic View”). A HAT system is able to scale linearly because replicas can serve requests without coordination: in our deployments, the HAT system was able to utilize all available replicas instead of a single replica for each data item (e.g., with two replicas, the HAT system achieved 2x the throughput).

A detailed report of our findings is available online, and we will present our results at VLDB 2014 in China next year.

Takeaways and a View of the Future

Given that transactional functionality is often thought to be “too expensive” for scalable data stores, it’s perhaps surprising that so many of the properties offered by real-world transactional data stores are actually achievable with availability and therefore horizontal scalability. Indeed, unavailable implementations may surface isolation anomalies less frequently than their HAT counterparts (i.e., in expectation, they may provide more intuitive behavior), but the worst-case behavior is the same. An isolation model like Read Committed is difficult to program and is, in many ways, no easier to program against than, say, eventual consistency. However, many of the techniques for programming Read Committed databases (e.g., “SELECT FOR UPDATE”) are also applicable to and have parallels to NoSQL stores.

What’s next for HAT databases? As we discuss in the paper, we think many applications can get away with HAT semantics, but, in some cases, coordination will be required. Understanding this trade-off will be key to programmability of future “coordination-free” data stores–after all, users ultimately care about consistency of their application, not necessarily the specific form of isolation or low-level, read/write data consistency offered by their data stores. Based on early results, we are bullish that even traditional OLTP applications can benefit from the techniques we’ve developed, from cheaper secondary indexing and foreign key constraints to full-blown transactional applications applying coordination only when it is (provably) required. Stay tuned.

How to Build A Bad Research Center

Posted on June 17, 2013 by Patterson

The AMPLab is part of a Berkeley tradition of creating 5-year multidisciplinary projects that build prototypes to demonstrate the project vision and depend on biannual retreats for feedback and open shared space to inspire collaboration.

After being involved in a dozen centers over nearly 40 years, I decided to capture my advice on building and running research centers . Following the precedent of my past efforts at “How to Give a Bad Talk” and “How to Have a Bad Career“, I just finished a short technical paper entitled “How to Build a Bad Research Center.”

As a teaser, below are my Eight Commandments to follow to build a bad research center:

Thou shalt not mix disciplines in a center. It is difficult for people from different disciplines to talk to each other, as they don’t share a common culture or vocabulary. Thus, multiple disciplines waste time, and therefore precious research funding. Instead, remain pure.
Thou shalt expand centers. Expanse is measured geographically, not intellectually. For example, in the US the ideal is having investigators from 50 institutions in all 50 states, as this would make a funding agency look good to the US Senate.
Thou shalt not limit the duration of a center. To demonstrate your faith in the mission of the center, you should be willing to promise to work on it for decades. (Or at least until the funding runs out.)
Thou shalt not build a graven prototype. Integrating results in a center-wide prototype takes time away from researchers’ own, more important, private research.
Thou shalt not disturb thy neighbors. Good walls make good researchers; isolation reduces the chances of being distracted from your work.
Thou shalt not talk to strangers. Do not waste time convening meetings to present research to outsiders; following the 8th commandment, reviews of your many papers supply sufficient feedback.
Thou shalt make decisions as a consensus of equals. The US Congress is a sterling example of making progress via consensus.
Thou shalt honor thy paper publishers. Thus, to ensure center success, you must write, write, write and cite, cite, cite. If the conference acceptance rate is 1/X, then obviously you should submit at least X papers, for otherwise chances are that your center will not have a paper at every conference, which is a catastrophe.