Configuring Replication to Elasticsearch Types
When you index a MapR-DB binary table, you set up replication from that table to a type in an Elasticsearch cluster. After an initial load of data from the source table into the type, updates to the source table are replicated immediately to the type. Updates are not replicated in batches but as they happen. Replication of data to Elasticsearch indexes is asynchronous. MapR-DB does not wait to receive confirmation that changes have been replicated before it notifies client applications that requested operations on the MapR-DB database are complete.
Prerequisites
- Only one user should manage indexing of any given source MapR-DB table. If
indexing of the table in a given Elasticsearch type is no longer needed and any
other user attempts to run the command
maprcli table replica elasticsearch remove
to stop replicating from the table to that Elasticsearch type, the command will fail with the message that permission is denied. - Configure two or more MapR gateways to handle communications between MapR-DB and and each Elasticsearch cluster. See MapR Gateways.
- Ensure that your Elasticsearch cluster is registered with your source MapR-DB cluster. See Registering Elasticsearch Clusters with MapR Clusters.
- Ensure that your user ID has the
readAce
andwriteAce
permissions on the volumes where the tables are located. For information about how to set permissions on volumes, see Setting/Modifying Whole Volume ACEs. - Run the
maprcli table info
command on the source table to verify that your user ID has the following permissions:-
readperm
, which is required for reading from the table. -
replperm
, which is required for replicating from the table.
-
- Ensure that the
_source
field in Elasticsearch is enabled for all documents.
About this task
- You cannot replicate data from more than one MapR-DB binary table into a single Elasticsearch type.
- The replication of deletes to Elasticsearch types is not supported.
- Versioning is not supported in Elasticsearch indexes. In MapR-DB (as in HBase), binary tables can store an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension, the version being specified as a long integer. However, in Elasticsearch, only one version of indexed cell data is retained.
- Do not replicate puts that are made with timestamps. Because Elasticsearch retains only the most recently indexed value for a cell, an Elasticsearch type will fall out of synchronization with its corresponding source table if any puts with timestamps are made to the table out of order or replicated to the type out of order.
Procedure
maprcli table replica elasticsearch autosetup
.
This example causes the indexing of all of the columns in column families
personal
and purchase
, as well as the
columns number_of_stars
and date
in the
column family review
, of the MapR-DB source table
customers
in the MapR cluster
sanfrancisco
.
maprcli table replica elasticsearch autosetup -path /mapr/sanfrancisco/customers -target myescluster -index myproduct -type customers -columns personal,purchase,review:number_of_stars,review:date
Results
Client applications can now start updating the source table in MapR and querying the indexed data in ES.
This command registers the destination type as a replica of the source table, copies the content of the source table into the type, and then starts the replication stream to keep the type updated.
maprcli
table replica elasticsearch autosetup
command starts a MapReduce
job. The length of the job depends on the size of the source table and the
number of columns that you are indexing. Moreover, the volume of data and the
speed at which the Elasticsearch type is populated could perceptibly slow the
performance of other processes running at the same time on the Elasticsearch
cluster. The less data there is to copy to the type, the faster the MapReduce
job will end and the fewer resources the job will consume on the Elasticsearch
cluster. By default, this command causes all column families to be replicated. If you want to
specify a subset of column families, individual columns, or both, use the
-columns
parameter. Columns that you specify do not have to
exist in the source table at the time that you run this command; you can create them
later. However, column families that you specify must exist in the source table at
the time that you run this command.
What to do next
If you ever need to change the selection of columns or column families that you want
to index, use the
maprcli table replica elasticsearch edit
command.
To see statistics about replication from the source table, include the number of
pending puts and the number of pending bytes to transfer, run the
maprcli table replica elasticsearch list
command, specifying the source table with the -path
parameter.