Got a Minute? Spin up a Spark cluster on your laptop with Docker.

Error: Unable to create directory uploads/2024/04. Is its parent directory writable by the server?

Apache Spark and Shark have made data analytics faster to write and faster to run on clusters. This post will teach you how to use Docker to quickly and automatically install, configure and deploy Spark and Shark as well. How fast? When we timed it, we found it took about 42 seconds to start up a pre-configured cluster with several workers on a laptop. You can use our Docker images to create a local development or test environment that’s very close to a distributed deployment.

Docker provides a simple way to create and run self-sufficient Linux containers that can be used to package arbitrary applications. Its main advantage is that the very same container that runs on your laptop can be executed in the same way on a production cluster. In fact, Apache Mesos recently added support for running Docker containers on compute nodes.

Docker runs on any standard 64-bit Linux distribution with a recent kernel but can also be installed on other systems, including Mac OS, by adding another layer of virtualization. First run the Docker Hello World! example to get started.

Running Spark in Docker

The next step is to clone the git repository that contains the startup scripts.

$ git clone -b blogpost git@github.com:amplab/docker-scripts.git

This repository contains deploy scripts and the sources for the Docker image files, which can be easily modified. (Contributions from the community are welcome, just send a pull request!)

Fast track: deploy a virtual cluster on your laptop

Start up a Spark 0.8.0 cluster and fall into the Spark shell by running

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

and get a Spark cluster with two worker nodes and HDFS pre-configured. During the first execution Docker will automatically fetch container images from the global repository, which are then cached locally.

Further details

Running the deploy script without arguments shows command line options.

$ sudo ./docker-scripts/deploy/deploy.sh
usage: ./docker-scripts/deploy/deploy.sh -i <image> [-w <#workers>] [-v <data_directory>] [-c]

  image:    spark or shark image from:
                 amplab/spark:0.7.3  amplab/spark:0.8.0
                 amplab/shark:0.7.0  amplab/shark:0.8.0

The script either starts a standalone Spark cluster or a standalone Shark cluster with a given number of worker nodes. Hadoop HDFS services are started as well. Since services depend on a properly configured DNS, one container will automatically be started with a DNS forwarder. All containers are also accessible via ssh using a pre-configured RSA key.

If you want to make a directory on the host accessible to the containers — say to import some data into Spark — just pass it with the -v option. This directory is then mounted on the master and worker containers under /data.

Both the Spark and Shark shells are started in separate containers. The shell container is started from the deploy script by passing -c option but can also be attached later.

So let’s start up a Spark 0.8.0 cluster with two workers and connect to the Spark shell right away.

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

You should see something like this:

*** Starting Spark 0.8.0 ***
...
***********************************************************************
connect to spark via:       sudo docker run -i -t -dns 10.0.3.89 amplab/spark-shell:0.8.0 10.0.3.90

visit Spark WebUI at:       http://10.0.3.90:8080/
visit Hadoop Namenode at:   http://10.0.3.90:50070
***********************************************************************

Once the shell is up, let’s run a small example:

scala> val textFile = sc.textFile("hdfs://master:9000/user/hdfs/test.txt")
scala> textFile.count()
scala> textFile.map({line => line}).collect()

After you are done you can terminate the cluster from the outside via

$ sudo docker-scripts/deploy/kill_all.sh spark
$ sudo docker-scripts/deploy/kill_all.sh nameserver

which will kill all Spark and nameserver containers.

Notes

If you are running Docker via vagrant on Mac OS make sure to increase the memory allocated to the virtual machine to at least 2GB.

There is more to come

Besides offering lightweight isolated execution of worker processes inside containers via LXC, Docker also provides a kind of combined git and github for container images. Watch the AMPLab Docker account for updates.

Transactions, High Availability, and Scalability: Friends or Foes?

Peter Bailis

Many of today’s data-intensive applications at scale require always-on operation and low-latency query execution, which has led to quite a shake-up in the data management space. While the battle between NoSQL, NewSQL, and “old SQL” involves many factors including market forces and data models, one factor is fundamental: some semantics are fundamentally at odds with the goals of always-on operation and low latency execution.  Since at least 1985, the database community has recognized that the gold standard of ACID database transactions—serializability, providing single-system programmability—is unachievable with these goals: a system facing network partitions between servers cannot provide both serializability and guarantee a response from every server (what we’ll call “high availability”). In the absence of partitions, this high availability corresponds to the ability to provide low-latency operation, especially across datacenters. Even the latest and greatest production-ready serializable databases still top out at roughly 20 writes/record/second due to fundamental coordination bottlenecks.

ACID in the Real World

In recent work in the AMPLab and at Berkeley, we asked the question: are all transactional models incompatible with the goal of high availability? With the help of Alan Fekete, a database isolation whiz visiting us from the University of Sydney, we looked at transactional models in the real world and were surprised to find that, in fact, many popular databases didn’t even offer serializability as an option to end-users at all! In most cases, databases offered “weaker” models like Read Committed and Snapshot Isolation by default (sometimes labeling these models as “serializable”!). There is good reason for this phenomenon: these levels offer increased concurrency and often fewer system-induced aborts due to deadlocks and contention. However, these isolation models also expose a range of concurrency anomalies to end-users. Take a look at the summary in the chart below:

hat-isolation-survey

Enter Highly Available Transactions 

Given that many real-world transactions aren’t serializable, we set out to understand which are achievable with high availability (i.e, as Highly Available Transactions, or HATs). To answer this question, we had to distinguish between unavailable implementations and unavailable semantics: for example, a lock-based implementation of Read Committed is unavailable under partial failures, but the accepted definition of Read Committed is achievable if we design concurrency control with availability as a top priority (e.g., for Read Committed, a database can buffer client writes until commit). Surprisingly, we found that many useful models like ANSI Repeatable Read and atomic multi-item writes and reads are also achievable with high availability (see Figure 2 of our paper) for more details. We also considered a modified but important availability model where users maintain affinity with a set of servers–we prove that this “sticky availability” is required for common guarantees like “Read Your Writes” and the increasingly popular Causal Consistency guarantees.

To determine the benefit of achieving high availability, we both analyzed the fundamental costs of coordination and benchmarked a prototype HAT implementation. The fundamental cost of giving up availability is that each operation requires at least one round-trip time to a coordination service (e.g., a lock manager or optimistic concurrency control validation service) to safely complete. In our measurements on EC2, this equated to a 10-100x increase in latency compared to HAT datacenter-local traffic. While our goal was not to design optimal implementations of HAT semantics, we were able to measure the overhead of a simple HAT implementation of Read Committed and atomic reads and writes (“Monotonic Atomic View”). A HAT system is able to scale linearly because replicas can serve requests without coordination: in our deployments, the HAT system was able to utilize all available replicas instead of a single replica for each data item (e.g., with two replicas, the HAT system achieved 2x the throughput).

A detailed report of our findings is available online, and we will present our results at VLDB 2014 in China next year.

Takeaways and a View of the Future 

Given that transactional functionality is often thought to be “too expensive” for scalable data stores, it’s perhaps surprising that so many of the properties offered by real-world transactional data stores are actually achievable with availability and therefore horizontal scalability. Indeed, unavailable implementations may surface isolation anomalies less frequently than their HAT counterparts (i.e., in expectation, they may provide more intuitive behavior), but the worst-case behavior is the same. An isolation model like Read Committed is difficult to program and is, in many ways, no easier to program against than, say, eventual consistency. However, many of the techniques for programming Read Committed databases (e.g., “SELECT FOR UPDATE”) are also applicable to and have parallels to NoSQL stores.

What’s next for HAT databases? As we discuss in the paper, we think many applications can get away with HAT semantics, but, in some cases, coordination will be required. Understanding this trade-off will be key to programmability of future “coordination-free” data stores–after all, users ultimately care about consistency of their application, not necessarily the specific form of isolation or low-level, read/write data consistency offered by their data stores. Based on early results, we are bullish that even traditional OLTP applications can benefit from the techniques we’ve developed, from cheaper secondary indexing and foreign key constraints to full-blown transactional applications applying coordination only when it is (provably) required. Stay tuned.