Got a Minute? Spin up a Spark cluster on your laptop with Docker.

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

Apache Spark and Shark have made data analytics faster to write and faster to run on clusters. This post will teach you how to use Docker to quickly and automatically install, configure and deploy Spark and Shark as well. How fast? When we timed it, we found it took about 42 seconds to start up a pre-configured cluster with several workers on a laptop. You can use our Docker images to create a local development or test environment that’s very close to a distributed deployment.

Docker provides a simple way to create and run self-sufficient Linux containers that can be used to package arbitrary applications. Its main advantage is that the very same container that runs on your laptop can be executed in the same way on a production cluster. In fact, Apache Mesos recently added support for running Docker containers on compute nodes.

Docker runs on any standard 64-bit Linux distribution with a recent kernel but can also be installed on other systems, including Mac OS, by adding another layer of virtualization. First run the Docker Hello World! example to get started.

Running Spark in Docker

The next step is to clone the git repository that contains the startup scripts.

$ git clone -b blogpost git@github.com:amplab/docker-scripts.git

This repository contains deploy scripts and the sources for the Docker image files, which can be easily modified. (Contributions from the community are welcome, just send a pull request!)

Fast track: deploy a virtual cluster on your laptop

Start up a Spark 0.8.0 cluster and fall into the Spark shell by running

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

and get a Spark cluster with two worker nodes and HDFS pre-configured. During the first execution Docker will automatically fetch container images from the global repository, which are then cached locally.

Further details

Running the deploy script without arguments shows command line options.

$ sudo ./docker-scripts/deploy/deploy.sh
usage: ./docker-scripts/deploy/deploy.sh -i <image> [-w <#workers>] [-v <data_directory>] [-c]

  image:    spark or shark image from:
                 amplab/spark:0.7.3  amplab/spark:0.8.0
                 amplab/shark:0.7.0  amplab/shark:0.8.0

The script either starts a standalone Spark cluster or a standalone Shark cluster with a given number of worker nodes. Hadoop HDFS services are started as well. Since services depend on a properly configured DNS, one container will automatically be started with a DNS forwarder. All containers are also accessible via ssh using a pre-configured RSA key.

If you want to make a directory on the host accessible to the containers — say to import some data into Spark — just pass it with the -v option. This directory is then mounted on the master and worker containers under /data.

Both the Spark and Shark shells are started in separate containers. The shell container is started from the deploy script by passing -c option but can also be attached later.

So let’s start up a Spark 0.8.0 cluster with two workers and connect to the Spark shell right away.

$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

You should see something like this:

*** Starting Spark 0.8.0 ***
...
***********************************************************************
connect to spark via:       sudo docker run -i -t -dns 10.0.3.89 amplab/spark-shell:0.8.0 10.0.3.90

visit Spark WebUI at:       http://10.0.3.90:8080/
visit Hadoop Namenode at:   http://10.0.3.90:50070
***********************************************************************

Once the shell is up, let’s run a small example:

scala> val textFile = sc.textFile("hdfs://master:9000/user/hdfs/test.txt")
scala> textFile.count()
scala> textFile.map({line => line}).collect()

After you are done you can terminate the cluster from the outside via

$ sudo docker-scripts/deploy/kill_all.sh spark
$ sudo docker-scripts/deploy/kill_all.sh nameserver

which will kill all Spark and nameserver containers.

Notes

If you are running Docker via vagrant on Mac OS make sure to increase the memory allocated to the virtual machine to at least 2GB.

There is more to come

Besides offering lightweight isolated execution of worker processes inside containers via LXC, Docker also provides a kind of combined git and github for container images. Watch the AMPLab Docker account for updates.