Configuring Replication to Elasticsearch Types

When you index a MapR-DB binary table, you set up replication from that table to a type in an Elasticsearch cluster. After an initial load of data from the source table into the type, updates to the source table are replicated immediately to the type. Updates are not replicated in batches but as they happen. Replication of data to Elasticsearch indexes is asynchronous. MapR-DB does not wait to receive confirmation that changes have been replicated before it notifies client applications that requested operations on the MapR-DB database are complete.

Prerequisites

  • Only one user should manage indexing of any given source MapR-DB table. If indexing of the table in a given Elasticsearch type is no longer needed and any other user attempts to run the command maprcli table replica elasticsearch remove to stop replicating from the table to that Elasticsearch type, the command will fail with the message that permission is denied.
  • Configure two or more MapR gateways to handle communications between MapR-DB and and each Elasticsearch cluster. See MapR Gateways.
  • Ensure that your Elasticsearch cluster is registered with your source MapR-DB cluster. See Registering Elasticsearch Clusters with MapR Clusters.
  • Ensure that your user ID has the readAce and writeAce permissions on the volumes where the tables are located. For information about how to set permissions on volumes, see Setting/Modifying Whole Volume ACEs.
  • Run the maprcli table info command on the source table to verify that your user ID has the following permissions:
    • readperm, which is required for reading from the table.
    • replperm, which is required for replicating from the table.
  • Ensure that the _source field in Elasticsearch is enabled for all documents.

About this task

This task has the following restrictions:
  • You cannot replicate data from more than one MapR-DB binary table into a single Elasticsearch type.
  • The replication of deletes to Elasticsearch types is not supported.
  • Versioning is not supported in Elasticsearch indexes. In MapR-DB (as in HBase), binary tables can store an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension, the version being specified as a long integer. However, in Elasticsearch, only one version of indexed cell data is retained.
  • Do not replicate puts that are made with timestamps. Because Elasticsearch retains only the most recently indexed value for a cell, an Elasticsearch type will fall out of synchronization with its corresponding source table if any puts with timestamps are made to the table out of order or replicated to the type out of order.

Procedure

To configure replication from a MapR-DB binary table to an Elasticsearch type:
On the source MapR cluster, run the command maprcli table replica elasticsearch autosetup .

This example causes the indexing of all of the columns in column families personal and purchase, as well as the columns number_of_stars and date in the column family review, of the MapR-DB source table customers in the MapR cluster sanfrancisco.

maprcli table replica elasticsearch autosetup -path /mapr/sanfrancisco/customers -target myescluster -index myproduct -type customers -columns personal,purchase,review:number_of_stars,review:date

Results

Client applications can now start updating the source table in MapR and querying the indexed data in ES.

This command registers the destination type as a replica of the source table, copies the content of the source table into the type, and then starts the replication stream to keep the type updated.

NOTE: To copy the content of the source table into the type, the maprcli table replica elasticsearch autosetup command starts a MapReduce job. The length of the job depends on the size of the source table and the number of columns that you are indexing. Moreover, the volume of data and the speed at which the Elasticsearch type is populated could perceptibly slow the performance of other processes running at the same time on the Elasticsearch cluster. The less data there is to copy to the type, the faster the MapReduce job will end and the fewer resources the job will consume on the Elasticsearch cluster.

By default, this command causes all column families to be replicated. If you want to specify a subset of column families, individual columns, or both, use the -columns parameter. Columns that you specify do not have to exist in the source table at the time that you run this command; you can create them later. However, column families that you specify must exist in the source table at the time that you run this command.

What to do next

If you ever need to change the selection of columns or column families that you want to index, use the maprcli table replica elasticsearch edit command.

To see statistics about replication from the source table, include the number of pending puts and the number of pending bytes to transfer, run the maprcli table replica elasticsearch list command, specifying the source table with the -path parameter.