Balance the Load Across the Cluster
You can use the following methods to balance the load across the cluster:
Tune the Split Size to Maximize Slot Availability
If the input data for the job is spread out evenly in the cluster, it improves MapReduce
parallelism as more mappers can be scheduled to work on local data. The first task of any
job is a single task called setup
. The setup task examines the job input
data to determine how many splits to use. For each split, the setup task finds the
locations of the data to determine where to run the map task (one map task for each split).
To balance the load evenly across the cluster, pick a split size that will give you a
number of splits that fill at least a majority of the slots available to you, keeping in
mind other jobs that may be running at the same time. If the input data for the job is
spread out evenly in the cluster, it improves MapReduce parallelism as more mappers can be
scheduled to work on local data.
Use Partition Lists
If your data is not distributed evenly throughout key ranges, you can create a list of partition keys (partition.lst) instead. To build this list, run a small MapReduce job that samples a small percentage of your data and divides it into even key ranges. The mappers use the partition list to determine the splits before sorting.