Run Spark Jobs with Oozie
About this task
- (Optional) Update the Spark shared libraries. By default, Oozie ships with shared
libraries for a specific Spark version. To update the shared libraries with the version of
Spark that you are running, complete the following steps:
- Stop
Oozie:
maprcli node services -name oozie -action stop -nodes <space delimited list of nodes>
- In the
/opt/mapr/oozie/oozie-<version>/share2/lib/spark
directory, remove all*.jar
files EXCEPToozie-sharelib-spark-<version>-mapr.jar
. - As of Oozie 4.2.0-1510, in the
/opt/mapr/oozie/oozie-<version>/share1/lib/spark
directory, also remove all*.jar
files EXCEPToozie-sharelib-spark-<version>-mapr.jar
. - Copy
spark-assembly-*.jar
to/opt/mapr/oozie/oozie-<version>/share2/lib/spark/
directory.cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share2/lib/spark/
- As of Oozie 4.2.0-1510, also copy
spark-assembly-*.jar
to/opt/mapr/oozie/oozie-<version>/share1/lib/spark/
directory.cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share1/lib/spark/
- For Spark 1.5.2-1603 and above, when the cluster is secure and it uses Kerberos
authentication, the
spark-default.conf
should be copied to/opt/mapr/oozie/oozie-<version>/conf/spark-conf
.mkdir /opt/mapr/oozie/oozie-<version>/conf/spark-conf
cp /opt/mapr/spark/spark-<version>/conf/spark-defaults.conf /opt/mapr/oozie/oozie-<version>/conf/spark-conf/
- Start Oozie:
maprcli node services -name oozie -action start -nodes <space delimited list of nodes>
- As of Oozie 4.1.0-1601 and Oozie 4.2.0-1601, if the
oozie.service.WorkflowAppService.system.libpath
property in oozie-site.xml does not use the default value (/oozie/share/), you must run perform the following steps to update the shared libraries:- Based on the cluster MapReduce mode, run one of the following commands to copy
the new Oozie shared libraries to MapR-FS:
Cluster MapReduce Mode Command YARN sudo -u mapr {OOZIE_HOME}/bin/oozie-setup.sh sharelib create -fs maprfs:/// -locallib /opt/mapr/oozie/oozie-<version>/share2
Classic sudo -u mapr {OOZIE_HOME}/bin/oozie-setup.sh sharelib create -fs maprfs:/// -locallib /opt/mapr/oozie/oozie-<version>/share1
- Run the following command to update the Oozie classpath with the new shared
libraries:
sudo -u mapr {OOZIE_HOME}/bin/oozie admin -sharelibupdate
- Based on the cluster MapReduce mode, run one of the following commands to copy
the new Oozie shared libraries to MapR-FS:
- Stop
Oozie:
- Configure a Spark action. You can use Oozie 4.1.0 or greater to run a Spark job.
To run a Spark job, add a Spark action to the
workflow.xml
associated with the workflow that should run the Spark job.When you configure Spark action in theworkflow.xml
, specify themaster
element based on the mode of the Spark job:- For Spark standalone mode, specify the Spark Master URL in the
master
element. For example, if your SparkMaster URL isspark://ubuntu2:7077
, you would replace the<master>local[*]</master>
in the example below withmaster>
. - For Spark on YARN mode, specify
yarn-client
oryarn-cluster
in themaster
element. For example, foryarn-cluster
mode, you would replace<master>local[*]</master>
with<master>yarn-cluster</master>
.Here is an example of a Spark action within a
workflow.xml
file:<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'> <start to='spark-node' /> <action name='spark-node'> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>local[*]</master> <name>Spark-FileCopy</name> <class>org.apache.oozie.example.SparkFileCopy</class> <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar> <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg> <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg> </spark> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message> </kill> <end name='end' /> </workflow-app>
- For Spark standalone mode, specify the Spark Master URL in the