MapR-DB OJAI Connector for Apache Spark

This section describes how to use the MapR-DB OJAI Connector for Apache Spark as a tool to build real-time and batch pipelines between your data and MapR-DB JSON, leveraging Spark within the pipeline.

Included is a set of APIs that enable MapR users to write applications that consume MapR-DB JSON tables and use them in Spark. The MapR-DB OJAI Connector for Apache Spark is a companion to the MapR-DB Binary Connector for Apache Spark, which provides the equivalent functionality for MapR-DB Binary tables.

Batch Data Transformation with MapR-DB as a Source and Destination for Spark

You can use the MapR-DB OJAI Connector with batch data. In this diagram, data from MapR-DB or MapR-FS is extracted and transformed using either Spark or Spark SQL, and then loaded into MapR-DB JSON:

Apache Spark Concepts

The following definition list summarizes key concepts referred to in this documentation:

Basic Spark
The initial Apache Spark implementation. It supports the Resilient Distributed Dataset (RDD) API. Also referred to simply as "Spark".
Spark SQL
Introduced after Basic Spark. It provides a more advanced way of accessing data. It performs and scales better than Basic Spark. To use Spark SQL, you use one of SQL queries, the Dataset API, or the DataFrame API.
RDD
Provides an abstraction for parallel access to data partitioned across nodes in an Apache Spark cluster.
Dataset

A distributed collection of data. It is similar to an RDD but leverages the benefits of the Spark SQL optimized query engine.

DataFrame

A Dataset organized into named columns: a Dataset of rows.

DStream
A sequence of RDDs representing a continuous stream of data.

MapR-DB OJAI Connector for Apache Spark Features

Principal features of the MapR-DB OJAI Connector for Apache Spark include the following:

  • Support for Scala, Java, and Python APIs
    Note: Support for Java and Python APIs is available starting in the MEP 4.1.0 release.
  • APIs that enable you to load data from a MapR-DB JSON table to an Apache Spark RDD, DataFrame, or Dataset
  • Projection and filter pushdown

    Whenever possible, the MapR-DB OJAI Connector for Apache Spark pushes projections and filter conditions for better performance.

  • Custom partitioner for RDDs

    RDDs support a custom partitioner that enables you to partition data for better performance.

  • APIs that save an Apache Spark RDD, DataFrame, or DStream to a MapR-DB JSON table using either normal or bulk insert
  • Support for Scala and Java bean classes

    You can load JSON documents as an RDD of Scala or Java bean classes.

  • Data locality

    When the connector reads data from MapR-DB, it uses the data locality feature of MapR-DB to spawn the Spark executors.

The following features are not supported:

  • MapR-DB Binary tables

    Only MapR-DB JSON tables are supported; access to MapR-DB binary tables is provided through the MapR-DB Binary Connector.

  • Secondary indexes
This matrix shows the programming languages and features supported:
  Scala Java Python
RDD Yes Yes No
DataFrame Yes Yes Yes
Dataset Yes Yes No
DStream Yes No No
Note: Examples for topics include Scala, Java, and Python implementations. If any of these implementations are missing, the feature is not supported for that language.

Supported Product Versions and System Requirements

To use the MapR-DB OJAI Connector for Apache Spark, you must have the following minimum software versions:

  • MapR: 5.2.1 or later
  • MEP 3.0 or later
  • Spark 2.1.0 or later
  • Scala 2.11 or later
  • Java 8 or later
Support for DataFrames and Datasets is available starting in the MEP 4.0 release. And support for Java and Python APIs is available starting in the MEP 4.1.0 release.

OJAI API

The MapR-DB OJAI Connector for Apache Spark uses the OJAI API internally to access MapR-DB JSON tables.