Prevent Data Spills

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.
The application consumes memory to run the map function.

MapReduce framework memory is controlled by io.sort.mb. If io.sort.mb is less than the data emitted from the mapper, the task ends up spilling data to disk. By default, io.sort.mb is set to 380MB for a 256MB chunk size. Set the value of io.sort.mb to approximately 1.5 times the number of data bytes emitted from the mapper. If io.sort.mb is too large, the task can run out of memory or waste allocated memory. If you cannot resolve memory problems by adjusting the value of io.sort.mb, then try to re-write the application to use less memory in its map function. If the output from map tasks doesn’t fit in reducer memory, the data spills to disk. The data is merged and then merged again until there is one output file for each reducer. The io.sort.factor parameter controls how many merged files are read into memory at once. If you have many reducers relative to the memory and disk on each node, lower this number.