Installing Spark Standalone

The following instructions explain how to install Spark Standalone using manual steps.

Prerequisites

You can also install Spark Standalone using the MapR Installer.

The following procedure uses the operating system package managers to download and install Spark packages from the MapR Repository. For instructions on setting up the ecosystem repository (which includes Spark), see Prepare Packages and Repositories.

About this task

Spark is distributed as three separate packages:

Package	Description
mapr-spark	Install this package on Spark worker nodes. This package is dependent on the mapr-client package.
mapr-spark-master	Install this package on Spark master nodes. Spark master nodes must be able to communicate with Spark worker nodes over SSH without using passwords. This package is dependent on the mapr-spark package and the mapr-core package.
mapr-spark-historyserver	Install this optional package on Spark History Server nodes. This package is dependent on the mapr-spark package and mapr-core package.

Execute the following commands as root or using

sudo.

Procedure

Create the /apps/spark directory on MapR-FS and set the correct permissions on the directory.
```
hadoop fs -mkdir /apps/spark
hadoop fs -chmod 777 /apps/spark
```
Use the appropriate commands for your operating system to install Spark.
On CentOS / RedHat
```
yum install mapr-spark mapr-spark-master mapr-spark-historyserver
```
On Ubuntu
```
apt-get install mapr-spark mapr-spark-master mapr-spark-historyserver
```
NOTE: The mapr-spark-historyserver package is optional.

Spark is installed into the /opt/mapr/spark directory.
On the nodes where you installed the Spark master and Spark History Server packages, run the following command. This command integrates the Spark master and Spark History Server service with the Warden daemon:
```
/opt/mapr/server/configure.sh -R
```
Copy the /opt/mapr/spark/spark-<version>/conf/slaves.template into /opt/mapr/spark/spark-<version>/conf/slaves, and add the hostnames of the Spark worker nodes. Put one worker node hostname on each line. For example:
```
localhost
worker-node-1
worker-node-2
```
Set up Preparing Each Node for the mapr user such that the Spark master node has access to all slave nodes defined in the conf/slaves file.
As the mapr user, start the worker nodes by running the following command in the master node. Since the Master daemon is managed by the Warden daemon, do not use the start-all.sh or stop-all.sh command.
```
/opt/mapr/spark/spark-<version>/sbin/start-slaves.sh
```
If the cluster is secure and YARN is not installed on the cluster, you must comment out the following entries in the yarn-site.xml file (/opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop/yarn-site.xml):
- yarn.resourcemanager.ha.custom-ha-enabled
- yarn.client.failover-proxy-provider
- yarn.resourcemanager.recovery.enabled
Otherwise, the following errors may appear when you run a Spark job:
```
<DATE> <TIME> WARN ZKDataRetrieval: Can not get 
children of /services/resourcemanager/master with error: 
KeeperErrorCode = NoNode for /services/resourcemanager/master
```
```
<DATE> <TIME> ERROR MapRZKRMFinderUtils: Unable 
to determine ResourceManager service address from Zookeeper 
at <IP:port> java.lang.RuntimeException: Unable to determine 
ResoureManager service address from Zookeeper at <IP:port> 
```
If you want to integrate Spark with MapR Streams, install the Streams Client on each Spark node.
- On Ubuntu:
```
 apt-get install mapr-kafka
```
- On RedHat/CentOS:
```
yum install mapr-kafka
```

Test your new installation by running the SparkPi example. Use the following command:

MASTER=spark://<Spark Master node hostname>:7077 /opt/mapr/spark/spark-<version>/bin/run-example org.apache.spark.examples.SparkPi 10