Chunk Size

Files in MapR-FS are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size, but tuning the size correctly is important:

  • Smaller chunk sizes result in larger numbers of map tasks, which can result in lower performance due to task scheduling overhead
  • Larger chunk sizes require more memory to sort the map task output, which can crash the JVM or add significant garbage collection overhead MapR can deliver a single stream at upwards of 300 MB per second, making it possible to use larger chunks than in stock Hadoop. Generally, it is wise to set the chunk size between 64 MB and 256 MB.

Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specified by the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the old chunk size settings. Further writes to the file will use the file's existing chunk size.

Setting Chunk Size

You can set the chunk size for a given directory in two ways:

  • Change the ChunkSize attribute in the .dfs_attributes file at the top level of the directory
  • Use the command hadoop mfs -setchunksize <size> <directory>

For example, if the volume test is NFS-mounted at /mapr/my.cluster.com/projects/test you can set the chunk size to 268,435,456 bytes by editing the file /mapr/my.cluster.com/projects/test/.dfs_attributes and setting ChunkSize=268435456. To accomplish the same thing from the hadoop shell, use the following command:

hadoop mfs -setchunksize 268435456 /mapr/my.cluster.com/projects/test