Dave Hrycyszyn

2014-05-14 10:00:00 +0100

Setting up Spark in cluster mode

In the previous post, we set up the simplest possible Spark job and ran it local mode.

This is about as easy as it gets, and it was a good intro experiment. To see the true benefits of Spark, though, you’ll want to run your jobs on a Spark cluster, so that they can be distributed across multiple networked machines. As a first step towards this goal, let’s see how our original example changes when we run it on a cluster. For the moment, we will run the application in cluster mode on a single machine in order to keep things simple.

At the time of writing, the latest stable Spark is 0.9.1. Let’s download that and set it up in cluster mode, again running from our local machine.

Go to the Spark downloads page, and grab a Spark distribution. Spark sits on top of Hadoop’s infrastructure, and it comes in multiple flavours which match the various Hadoop distributions. You can either build Spark from source yourself and integrate it with an existing Hadoop cluster, or you can download a specific pre-built Hadoop distribution to run yourself. Each pre-built tarball includes Spark and a specific version of Hadoop.

  • Hadoop1 [HDP1, CDH3] is a Hadoop 1.x distribution which includes the Cloudera Distribution for Apache Hadoop, version 3 (aka CDH3).
  • CDH4 is the Cloudera Distribution for Apache Hadoop, version 4.
  • Hadoop2 [HDP2, CDH5] is a Hadoop 2.x distribution which includes the Cloudera Distribution for Apache Hadoop, version 5 (aka CDH5)

I happen to have a project I’m working on which is on CDH3, so I’m going to use that.

Let’s define a place where Spark lives on your system. If you’re on a Mac, open up ~/.bash_profile, and if you’re on Linux, open ~/.bashrc, and add the following to it:

export SPARK_HOME="/path/to/spark-0.9.1-bin-hadoop1"

Change this path to wherever you typically keep downloaded software on your development machine - I’m just putting mine in a directory called software in my home folder, you might choose /usr/local/src. Make sure you source ~/.bash_profile or source ~/.bashrc after you add SPARK_HOME so that your current terminal picks up the change.

Next, download Spark HDP1, CHD3 and extract it at $SPARK_HOME, then change to that directory in your terminal:

cd $SPARK_HOME

It’s almost time to start our cluster, using the command sbin/start-all.sh. However, if we’re impatient and try it out, we’ll see that this won’t work just yet:

sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /Users/dave/software/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-dave-org.apache.spark.deploy.master.Master-1-ED209.local.out
localhost: ssh: connect to host localhost port 22: Connection refused

What’s up with the “Connection refused”?

In order to run in cluster mode, Spark needs SSH access to every machine in the cluster, so that it can distribute instructions from the master node to worker nodes, and also make your program jarfiles available to the different machines in the cluster.

Despite the fact that we’re only going to be running on a single machine for the purposes of this example, we need to allow ssh access from our development machine to itself in order for the cluster setup to work.

This is pretty easy to do. If you’ve already got a public key set up on your machine at ~/.ssh/id_rsa.pub, give your machine the ability to log into itself, like this:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If you don’t happen to have an SSH key yet, or if your SSH key has a password on it, you can do generate a new one like this, and save the resulting key without a password:

ssh-keygen

Save the resulting keyfiles as ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. Make sure you have run the cat command from the example above so that the new key gets added into authorized_keys.

Next, make sure you you have an SSH server running on your machine. If you’re on a Debian or Ubuntu Linux, sudo apt-get install openssh-server will set one up for you. On a Mac, you can just go the “Sharing” preferences and tick the “Remote Login” tickbox.

With cluster access set up, let’s try starting our cluster again.

cd $SPARK_HOME
sbin/stop-all.sh # this will clean up our last, not-quite-successful startup attempt
sbin/start-all.sh

Output should look something like this:

$ sbin/start-all.sh
org.apache.spark.deploy.master.Master running as process 3273. Stop it first.
localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/dave/software/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-dave-org.apache.spark.deploy.worker.Worker-1-ED209.local.out

If your Spark master gives you a message that its port is already in use, make sure you’re not already running something else on port 8080.

We’ve now got two Spark-related processes running. The Spark Master node should be running a web interface on port 8080, accessible through http://localhost:8080. Clicking on the single link to the Worker node, you get an overview of what jobs that worker’s currently running (nothing at present).

Ok, so now we’ve got a running cluster. Make a note of the cluster URL (by default, it’ll show up in the web UI as spark://yourhostname:7077). You’ll need it in a moment.

Let’s verify that the cluster is working, by running some code in the interactive shell:

cd $SPARK_HOME
bin/spark-shell

All going well, you should see something like this:

$ bin/spark-shell
14/05/11 16:21:30 INFO HttpServer: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/05/11 16:21:30 INFO HttpServer: Starting HTTP Server
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 0.9.1
      /_/

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.

You can either run some of the Spark Examples, or type exit to get out of the shell.