Integrate Hue with Spark

About this task

As of Hue 3.8.1-1507 and Spark 1.3.1, you can configure Hue to use the Spark Notebook UI. This allows users to submit Spark jobs from Hue.
NOTE: Spark Notebook is a beta feature that utilizes the Spark REST Job Server (Livy).
Complete the following steps as the root user or by using sudo:

Procedure

  1. Install the mapr-hue-livy-3.8.1 package on the node were you have installed the mapr-spark package and configured Spark.
    On Ubuntu
    apt-get install mapr-hue-livy
    On RedHat/ CentOS
    yum install mapr-hue-livy
    NOTE: If you do not install the mapr-hue-livy package on a node were the mapr-spark package is installed, the Livy service will not start.
  2. For Spark 1.3.1: Copy javax.servlet-api-3.1.0.jar to the spark lib directory.
    cp /opt/mapr/hue/hue-<version>/apps/spark/java-lib/javax.servlet-api-3.1.0.jar /opt/mapr/spark/spark-<version>/lib/
  3. In the spark-env.sh file, configure SPARK_SUBMIT_CLASSPATH environment variable to include the classpath to the servlet jar before the MAPR_SPARK_CLASSPATH.
    SPARK_SUBMIT_CLASSPATH=$SPARK_SUBMIT_CLASSPATH:/opt/mapr/spark/spark-<version>/lib/javax.servlet-api-3.1.0.jar:$MAPR_SPARK_CLASSPATH
  4. In the [spark] section of the hue.ini, set the livy_server_host parameter to the host where the Livy server is running.
    [spark]
    # IP or hostname of livy server.
    livy_server_host=ubuntu500
    NOTE: If the Livy server runs on the same node as the Hue UI, you are not required to set this property as the value defaults to the local host.
  5. If Spark jobs run on YARN, perform the following steps:
    1. Set livy_server_session_kind to yarn on the node where the Livy server is running.
      [spark]
      # IP or hostname of livy server.
      livy_server_host=ubuntu500
      livy_server_session_kind=yarn
    2. For Hue 3.9.0: Set the HUE_HOME and the HADOOP_CONF_DIR environment variables in the hue.sh file (/opt/mapr/hue/hue-<version>/bin/hue.sh).
      export HUE_HOME=${bin}/..export
          HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-<version>/etc/hadoop
      NOTE: If you do not set these environment variables, the following error appears in the Check Configuration page:
      The app won't work without running Livy Spark Server
  6. Restart the Spark REST Job Server (Livy).
    maprcli node services -name livy -action restart -nodes <livy node>
  7. Restart Hue.
    maprcli node services -name hue -action restart -nodes <hue node>
  8. Restart Spark.
    maprcli node services -name spark-master -action restart -nodes <space delimited list of nodes>

Results

Additional Information
  • NOTE: To access the Notebook UI, select Spark from the Query Editor in the Hue interface.
  • If needed, you can use the MCS or maprcli to start, stop, or restart the Livy Server. For more information, see Starting, Stopping, and Restarting Services.
NOTE: Troubleshooting Tip
If you have more that one version of Python installed, you may see the following error when executing Python samples:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe...

Workaround:

Set the following environment variables in /opt/mapr/spark/spark-<version>/conf/spark-env.sh:

export PYSPARK_PYTHON=/usr/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/python2.7