Integrate Spark-SQL with Avro

Integrate Spark-SQL with Avro when you want to read and write Avro data.

About this task

As of Spark 1.5.2, you must perform the following steps to perform the integration. Previous versions of Spark do not require these steps.

Procedure

  1. Download the Avro 1.7.7 JAR file to the Spark lib (opt/mapr/spark/spark-<version>/lib) directory.
    You can download the file from the maven repository: http://mvnrepository.com/artifact/org.apache.avro/avro/1.7.7
  2. Perform one of the following methods to add Avro 1.7.7 JAR to the classpath:
    • Prepend the Avro 1.7.7 JAR file to the spark.executor.extraClassPath and spark.driver.extraClassPath in the spark-defaults.conf (/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf) file:
      spark.executor.extraClassPath  /opt/mapr/spark/spark-1.5.2/lib/avro-1.7.7.jar:<rest_of_path>spark.driver.extraClassPath  /opt/mapr/spark/spark-1.5.2/lib/avro-1.7.7.jar:<rest_of_path>
    • Specify the Avro 1.7.7 JAR files with command line arguments on the spark shell:
      /opt/mapr/spark/spark-<version>/bin/spark-shell \
      --packages com.databricks:spark-avro_2.10:2.0.1 \
      --driver-class-path /opt/mapr/spark/spark-<version>/lib/avro-1.7.7.jar \
      --conf spark.executor.extraClassPath=/opt/mapr/spark/spark-<version>/lib/avro-1.7.7.jar --master <master-url>
      NOTE: In this case, the master URL for the cluster is either spark://<host>:7077 or yarn-client as yarn-cluster is not supported on the spark shell.