Integrate Spark with R

Integrate Spark with R when you want to run R programs as Spark jobs.

About this task

As of Spark 1.5.2, you can integrate Spark with R.

Procedure

  1. Install R 3.2.2 or greater on each node that will submit Spark jobs:
    • On Ubuntu:
      apt-get install r-base-dev
    • On CentOS/RedHat:
      yum install R

    For more information on installing R, see the R documentation.

  2. To verify the integration, run the following commands as the mapr user or as a user that mapr impersonates:
    1. Start SparkR:
      /opt/mapr/spark/spark-1.5.2/bin/sparkR --master <master-url>
    2. Run the following command to create a DataFrame using sample data:
      people <- read.df(sqlContext, "file:///opt/mapr/spark/spark-1.5.2/examples/src/main/resources/people.json", "json")
    3. Run the following command to display the data from the DataFrame that you just created:
      head(people)