Prevent Data Spills
Map tasks use memory mainly in two ways:
- The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.
- The application consumes memory to run the map function.
MapReduce framework memory is controlled by io.sort.mb. If
io.sort.mb is less than the data emitted from the mapper, the task
ends up spilling data to disk. By default, io.sort.mb
is set to 380MB for a 256MB chunk size. Set the value of
io.sort.mb
to approximately 1.5 times the number of
data bytes emitted from the mapper. If io.sort.mb
is
too large, the task can run out of memory or waste allocated
memory. If you cannot resolve memory problems by adjusting the
value of io.sort.mb
, then
try to re-write the application to use less memory in its map
function. If the output from map tasks doesn’t fit in reducer
memory, the data spills to disk. The data is merged and then merged
again until there is one output file for each reducer. The
io.sort.factor
parameter controls how many merged files are read into memory at
once. If you have many reducers relative to the memory and disk on
each node, lower this number.