Writing Custom MapReduce Jobs for MapR-DB Using Bulk Loads
You can use the
HFileOutputFormat
configureIncrementalLoad()
method for writing custom MapReduce jobs to
perform bulk loads. Although the name of the method implies that
you can use it only for incremental bulk loads, the method also
works for full bulk loads, provided that the
-bulkload
, BULKLOAD
, or
Bulkload
parameter for a table is set to
true, as described in Bulk
Loading and MapR-DB Tables.
The HFileOutputFormat
class on MapR clusters
distinguishes between Apache HBase tables and MapR tables, behaving
appropriately for each type. Existing workflows that rely on the
HFileOutputFormat
class, such as the
CopyTable
and ImportTsv
utilities,
support both types of tables without further
configuration.
If you have a custom MapReduce
applications that does not use
HFileOutputFormat.configureIncrementalLoad()
, simply
use the path to the MapR-DB table that you want to load. However,
using HFileOutputFormat.configureIncrementalLoad()
gives you at least two advantages:
-
This method performs a number of tasks that your application would otherwise need to do explicitly:
-
-
Inspect the table to configure a total order partitioner
-
Upload the partitions file to the cluster and adds it to the DistributedCache
-
Set the number of reduce tasks to match the current number of regions
-
Set the output key/value class to match
HFileOutputFormat
's requirements -
Set the reducer up to perform the appropriate sorting (either
KeyValueSortReducer
orPutSortReducer
)
-
- This method turns off Speculative Execution automatically. For details, see the note below.
Speculative Execution of MapReduce tasks is on by default. For custom applications that load MapR-DB tables, it is recommended to turn Speculative Execution off. When it is on, the tasks that import data might run multiple times. Multiple tasks for an incremental bulkload could insert one or more versions of a record into a table. Multiple tasks for a full bulkload could cause loss of data if the source data continues to be updated during the load.
If your custom MapReduce job uses
HFileOutputFormat.configureIncrementalLoad()
, you do not have to turn off
Speculative Execution manually.
HFileOutputFormat.configureIncrementalLoad()
turns it off automatically.
Speculative Execution is automatically turned off for MapReduce utilities such as
CopyTable
and ImportTsv
.
If you are writing a custom MapReduce job that does not use the HFileOutputFormat
configureIncrementalLoad()
method for bulk loading, you must turn off Speculative
Execution manually.
Turn of Speculative Execution by setting either of the following MapReduce parameters to false, depending on the version of MapReduce that you are using:
-
MRv1:
mapred.map.tasks.speculative.execution
-
MRv2:
mapreduce.map.speculative
If the job is programmatically written, you can turn off Speculative Execution at the code
level: job.setSpeculativeExecution(false);