Run Spark Jobs with Oozie

About this task

You can use Oozie 4.1.0 or greater to run Spark jobs. Complete the following steps to configure Oozie to run Spark jobs:
  1. (Optional) Update the Spark shared libraries. By default, Oozie ships with shared libraries for a specific Spark version. To update the shared libraries with the version of Spark that you are running, complete the following steps:
    1. Stop Oozie:
      maprcli node services -name oozie -action stop -nodes <space delimited list of nodes>
    2. In the /opt/mapr/oozie/oozie-<version>/share2/lib/spark directory, remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar.
    3. As of Oozie 4.2.0-1510, in the /opt/mapr/oozie/oozie-<version>/share1/lib/spark directory, also remove all *.jar files EXCEPT oozie-sharelib-spark-<version>-mapr.jar.
    4. Copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-<version>/share2/lib/spark/ directory.
      cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share2/lib/spark/
    5. As of Oozie 4.2.0-1510, also copy spark-assembly-*.jar to /opt/mapr/oozie/oozie-<version>/share1/lib/spark/ directory.
      cp /opt/mapr/spark/spark-<version>/lib/spark-assembly-*.jar /opt/mapr/oozie/oozie-<version>/share1/lib/spark/
    6. For Spark 1.5.2-1603 and above, when the cluster is secure and it uses Kerberos authentication, the spark-default.conf should be copied to /opt/mapr/oozie/oozie-<version>/conf/spark-conf.
      mkdir /opt/mapr/oozie/oozie-<version>/conf/spark-conf 
      cp /opt/mapr/spark/spark-<version>/conf/spark-defaults.conf /opt/mapr/oozie/oozie-<version>/conf/spark-conf/
    7. Start Oozie:
      maprcli node services -name oozie -action start -nodes <space delimited list of nodes>
    8. As of Oozie 4.1.0-1601 and Oozie 4.2.0-1601, if the oozie.service.WorkflowAppService.system.libpath property in oozie-site.xml does not use the default value (/oozie/share/), you must run perform the following steps to update the shared libraries:
      1. Based on the cluster MapReduce mode, run one of the following commands to copy the new Oozie shared libraries to MapR-FS:
        Cluster MapReduce Mode Command
        YARN
        sudo -u mapr {OOZIE_HOME}/bin/oozie-setup.sh sharelib create -fs maprfs:/// -locallib /opt/mapr/oozie/oozie-<version>/share2
        Classic
        sudo -u mapr {OOZIE_HOME}/bin/oozie-setup.sh sharelib create -fs maprfs:/// -locallib /opt/mapr/oozie/oozie-<version>/share1
      2. Run the following command to update the Oozie classpath with the new shared libraries:
        sudo -u mapr {OOZIE_HOME}/bin/oozie admin -sharelibupdate
  2. Configure a Spark action. You can use Oozie 4.1.0 or greater to run a Spark job. To run a Spark job, add a Spark action to the workflow.xml associated with the workflow that should run the Spark job.
    When you configure Spark action in the workflow.xml, specify the master element based on the mode of the Spark job:
    • For Spark standalone mode, specify the Spark Master URL in the master element. For example, if your SparkMaster URL is spark://ubuntu2:7077, you would replace the <master>local[*]</master> in the example below with master>.
    • For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. For example, for yarn-cluster mode, you would replace <master>local[*]</master> with <master>yarn-cluster</master>.

      Here is an example of a Spark action within a workflow.xml file:

      <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
          <start to='spark-node' />
          <action name='spark-node'>
              <spark xmlns="uri:oozie:spark-action:0.1">
                  <job-tracker>${jobTracker}</job-tracker>
                  <name-node>${nameNode}</name-node>
                  <master>local[*]</master>
                  <name>Spark-FileCopy</name>
                  <class>org.apache.oozie.example.SparkFileCopy</class>
                  <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
                  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
                  <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg>
              </spark>
              <ok to="end" />
              <error to="fail" />
          </action>
          <kill name="fail">
              <message>Workflow failed, error
                  message[${wf:errorMessage(wf:lastErrorNode())}]
              </message>
          </kill>
          <end name='end' />
      </workflow-app>