Job Configuration

Set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, you can set these parameters from the command line. Example:

hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"

When you submit a job, the JobClient creates job.xml by reading parameters from the following files in the following order:

  1. mapred-default.xml (MapReduce v1)
  2. The local mapred-site.xml - overrides identical parameters in mapred-default.xml
  3. Any settings in the job code itself - overrides identical parameters in mapred-site.xml

Parameter

Description

keep.failed.task.files

Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed.

Default value: false

mapred.job.reuse.jvm.num.tasks

How many tasks to run per jvm. If set to -1, there is no limit.

Default value: -1

mapred.job.impact.blacklisting

Specifies whether failures for a job should count toward the number specified by the TaskTracker parameter mapred.max.tracker.blacklists.

Default value: true

mapred.map.tasks.speculative.execution

If true, then multiple instances of some map tasks may be executed in parallel.

Default value: true

mapred.reduce.tasks.speculative.execution

If true, then multiple instances of some reduce tasks may be executed in parallel.

Default value: true

mapred.reduce.tasks

The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when the value of the mapred.job.tracker property is local.

Default value: -1

mapred.job.map.memory.physical.mb

Maximum physical memory limit for map task of this job. If limit is exceeded task attempt will be FAILED.

mapred.job.reduce.memory.physical.mb

Maximum physical memory limit for reduce task of this job. If limit is exceeded task attempt will be FAILED.

mapreduce.task.classpath.user.precedence

Set to true if user wants to set different classpath.

Default value: false

mapred.max.maps.per.node

Per-node limit on running map tasks for the job. A value of -1 signifies no limit.

Default value: -1

mapred.max.reduces.per.node

Per-node limit on running reduce tasks for the job. A value of -1 signifies no limit.

Default value: -1

mapred.running.map.limit

Cluster-wide limit on running map tasks for the job. A value of -1 signifies no limit.

Default value: -1

mapred.running.reduce.limit

Cluster-wide limit on running reduce tasks for the job. A value of -1 signifies no limit.

Default value: -1

mapreduce.tasktracker.cache.local.numberdirectories

This property's value sets the maximum number of subdirectories to create in a given distributed cache store. Cache items in excess of this limit are expunged whether or not the total size threshold is exceeded.

Default value: 10000

mapred.reduce.child.java.opts

Java opts for the reduce tasks. MapR Default heapsize (-Xmx) is determined by memory reserved for mapreduce at TaskTracker. Reduce task is given more memory than map task. Default memory for a reduce task = (Total Memory reserved for mapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots))

Default value: -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log

mapred.reduce.child.ulimit

io.sort.factor

The number of streams to merge simultaneously during file sorting. The value of this property determines the number of open file handles.

Default value: 256

io.sort.mb

This value sets the size, in megabytes, of the memory buffer that holds map outputs before writing the final map outputs. Lower values for this property increases the chance of spills. Recommended practice is to set this value to 1.5 times the average size of a map output.

Default value: 380

io.sort.record.percent

The percentage of the memory buffer specified by the io.sort.mb property that is dedicated to tracking record boundaries. The maximum number of records that the collection thread can collect before blocking is one-fourth of (io.sort.mb) x (io.sort.record.percent).

Default value: 0.17

io.sort.spill.percent

This property's value sets the soft limit for either the buffer or record collection buffers. Threads that reach the soft limit begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. Do not reduce this value below 0.5.

Default value: 0.99

mapred.reduce.slowstart.completed.maps

Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.

Default value: 0.95

mapreduce.reduce.input.limit

The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set.

Default value: -1

mapred.reduce.parallel.copies

The default number of parallel transfers run by reduce during the copy(shuffle) phase.

Default value: 12

jobclient.completion.poll.interval

This property's value specifies the JobClient's polling frequency in milliseconds to the JobTracker for updates about job status. Reduce this value for faster tests on single node systems. Adjusting this value on production clusters may result in undesired client-server traffic.

Default value: 5000

jobclient.output.filter

This property's value specifies the filter that controls the output of the task's userlogs that are sent to the JobClient's console. Legal values are:

  • NONE
  • KILLED
  • FAILED
  • SUCCEEDED
  • ALL

Default value: FAILED

jobclient.progress.monitor.poll.interval

This property's value specifies the JobClient's status reporting frequency in milliseconds to the console and checking for job completion.

Default value: 1000

job.end.notification.url

This property's value specifies the URL to call at job completion to report the job's end status. Only two variables are legal in the URL, $jobId and $jobStatus. When present, these variables are replaced by their respective values.

Default value: http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus

job.end.retry.attempts

This property's value specifies the maximum number of times that Hadoop attempts to contact the notification URL.

Default value: 0

job.end.retry.interval

This property's value specifies the interval in milliseconds between attempts to contact the notification URL.

Default value: 30000

keep.failed.task.files

Set this property's value to True to keep files for failed tasks. Because this storage is not automatically reclaimed by the system, keep files only for jobs that are failing. Setting this property's value to True also keeps map outputs in the reduce directory as the map outputs are consumed instead of deleting the map outputs on consumption.

Default value: False

local.cache.size

This property's value specifies the number of bytes allocated to each local TaskTracker directory to store Distributed Cache data.

Default value: 10737418240

mapr.centrallog.dir

This property's value specifies the relative path under a local volume path that points to the central log location, ${mapr.localvolumes.path}/<hostname>/${mapr.centrallog.dir}.

Default value: logs

mapr.localvolumes.path

The path for local volumes.

Default value: /var/mapr/local

map.sort.class

The default sort class for sorting keys.

Default value: org.apache.hadoop.util.QuickSort

tasktracker.http.threads

The number of worker threads that for the HTTP server.

Default value: 2

topology.node.switch.mapping.impl

The default implementation of the DNSToSwitchMapping. It invokes a script specified in the topology.script.file.name property to resolve node names. If no value is set for the topology.script.file.name property, the default value of DEFAULT_RACK is returned for all node names.

Default value: org.apache.hadoop.net .ScriptBasedMapping

topology.script.number.args

The max number of arguments that the script configured with the topology.script.file.name runs with. Each argument is an IP address.

Default value: 100

mapr.task.diagnostics.enabled

Set this property's value to True to run the MapR diagnostics script before killing an unresponsive task attempt.

Default value: False

mapred.acls.enabled

This property's value specifies whether or not to check ACLs for user authorization during various queue and job level operations. Set this property's value to True to enable access control checks made by the JobTracker and TaskTracker when users request queue and job operations using Map/Reduce APIs, RPCs, the console, or the web user interfaces.

Default value: False

mapred.child.oom_adj

This property's value specifies the adjustment to the out-of-memory value for the Linux-specific out-of-memory killer. Legal values are 0-15.

Default value: 10

mapred.child.renice

This property's value specifies an integer from 0 to 19 for use by the Linux nice}} utility.

Default value: 10

mapred.child.taskset

Set this property's value to False to prevent running the job in a taskset. See the manual page for taskset(1) for more information.

Default value: True

mapred.child.tmp

This property's value sets the location of the temporary directory for map and reduce tasks. Set this value to an absolute path to directly assign the directory. Relative paths are located under the task's working directory. Java tasks execute with the option -Djava.io.tmpdir=absolute path of the tmp dir . Pipes and streaming are set with environment variable TMPDIR=absolute path of the tmp dir .

Default value: ./tmp

mapred.cluster.ephemeral.tasks.memory.limit.mb

This property's value specifies the maximum size in megabytes for small jobs. This value is reserved in memory for an ephemeral slot. JobTracker and TaskTracker nodes must set this property to the same value.

Default value: 200

mapred.cluster.map.memory.mb

This property's value sets the virtual memory size of a single map slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.map.memory.mb, to the limit specified by the value of mapred.cluster.max.map.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.

mapred.cluster.max.map.memory.mb

This property's value sets the virtual memory size of a single map task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.map.memory.mb, to the limit specified by the value of mapred.cluster.max.map.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.

mapred.cluster.max.reduce.memory.mb

This property's value sets the virtual memory size of a single reduce task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.reduce.memory.mb, to the limit specified by the value of mapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.

mapred.cluster.reduce.memory.mb

This property's value sets the virtual memory size of a single reduce slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.reduce.memory.mb, to the limit specified by the value of mapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature.

mapred.compress.map.output

Set this property's value to True to compress map outputs with SequenceFile compresison before sending the outputs over the network.

Default value: False

mapred.fairscheduler.assignmultiple

Set this property's value to False to prevent the FairScheduler from assigning multiple tasks.

Default value: True

mapred.fairscheduler.eventlog.enabled

Set this property's value to True to enable scheduler logging in {{${HADOOP_LOG_DIR}/fairscheduler/

Default value: False

mapred.fairscheduler.smalljob.max.inputsize

Specifies the maximum size, in bytes, that defines a small job.

Default value: 10737418240

mapred.fairscheduler.smalljob.max.maps

Specifies the maximum number of maps allowed in a small job.

Default value: 10

mapred.fairscheduler.smalljob.max.reducer.inputsize

Specifies the maximum estimated input size, in bytes, for a reducer in a small job.

Default value: 10737418240

mapred.fairscheduler.smalljob.max.reducers

Specifies the maximum number of reducers allowed in a small job.

Default value: 10

mapred.healthChecker.interval

Sets the frequency, in milliseconds, that the node health script runs.

Default value: 60000

mapred.healthChecker.script.timeout

Sets the frequency, in milliseconds, after which the node script is killed for being unresponsive and reported as failed.

Default value: 600000

mapred.inmem.merge.threshold

When a number of files equal to this property's value accumulate, the in-memory merge triggers and spills to disk. Set this property's value to zero or less to force merges and spills to trigger solely on RAMFS memory consumption.

Default value: 1000

mapred.job.map.memory.mb

Sets the virtual memory size of a single map task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.cluster.map.memory.mb, to the limit specified by the value of mapred.cluster.max.map.memory.mb. The default value of -1 disables the feature if the value of the mapred.cluster.map.memory.mgb property is also -1. Set this value to a useful memory size to enable the feature.

mapred.job.queue.name

Specifies the queue a job is submitted to. This property's value must match the name of a queue defined in mapred.queue.names for the system. The ACL setup for the queue must allow the current user to submit a job to the queue.

Default value: default

mapred.job.reduce.input.buffer.percent

Specifies the percentage of memory relative to the maximum heap size. After the shuffle, remaining map outputs in memory must occupy less memory than this threshold value before reduce begins.

Default value: 0

mapred.job.reduce.memory.mb

Sets the virtual memory size of a single reduce task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.cluster.reduce.memory.mb, to the limit specified by the value of mapred.cluster.max.reduce.memory.mb. The default value of -1 disables the feature if the value of the mapred.cluster.map.memory.mgb property is also -1. Set this value to a useful memory size to enable the feature.

Default value: -1

mapred.job.reuse.jvm.num.tasks

Sets the number of tasks to run on each JVM. The default of -1 sets no limit.

mapred.job.shuffle.input.buffer.percent

Sets the percentage of memory allocated from the maximum heap size to storing map outputs during the shuffle.

Default value: 0.7

mapred.job.shuffle.merge.percent

Sets a percentage of the total memory allocated to storing map outputs in mapred.job.shuffle.input.buffer.percent. When memory storage for map outputs reaches this percentage, an in-memory merge triggers.

Default value: 0.66

mapred.job.tracker.handler.count

Sets the number of server threads for the JobTracker. As a best practice, set this value to approximately 4% of the number of TaskTracker nodes.

Default value: 10

mapred.job.tracker.history.completed.location

Sets a location to store completed job history files. When this property has no value specified, completed job files are stored at ${hadoop.job.history.location}/done in the local filesystem.

Default value: /var/mapr/cluster/mapred/jobTracker/history/done

mapred.job.tracker.http.address

Specifies the HTTP server address and port for the JobTracker. Specify 0 as the port to make the server start on a free port.

Default value: 0.0.0.0:50030

mapred.jobtracker.instrumentation

Expert: The instrumentation class to associate with each JobTracker.

Default value: org.apache.hadoop.mapred.JobTrackerMetricsInst

mapred.jobtracker.job.history.block.size

Sets the block size of the job history file. Dumping job history to disk is important because job recovery uses the job history.

Default value: 3145728

mapred.jobtracker.jobhistory.lru.cache.size

Specifies the number of job history files to load in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU.

Default value: 5

mapred.job.tracker

JobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. ""local"" for standalone mode.

Default value: maprfs:///

mapred.jobtracker.maxtasks.per.job

Set this property's value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum.

Default value: -1

mapred.job.tracker.persist.jobstatus.active

Set this property's value to True to enable persistence of job status information.

Default value: False

mapred.job.tracker.persist.jobstatus.dir

This property's value specifies the directory where job status information persists after dropping out of the memory queue between JobTracker restarts.

Default value: /var/mapr/cluster/mapred/jobTracker/jobsInfo

mapred.job.tracker.persist.jobstatus.hours

This property's value specifies job status information persistence time in hours. Persistent job status information is available after the information drops out of the memory queue and between JobTracker restarts. The default value of zero disables job status information persistence.

Default value: 0

mapred.jobtracker.port

The IPC port on which the JobTracker listens.

Default value: 9001

mapred.jobtracker.restart.recover

Set this property's value to False to disable job recovery on restart.

Default value: True

mapred.jobtracker.retiredjobs.cache.size

This property's value specifies the number of retired job statuses kept in the cache.

Default value: 1000

mapred.jobtracker.retirejob.check

This property's value specifies the frequency interval used by the retire job thread to check for completed jobs.

Default value: 30000

mapred.line.input.format.linespermap

Number of lines per split in NLineInputFormat.

Default value: 1

mapred.local.dir.minspacekill

This property's value specifies a threshold of free space in the directory specified by the mapred.local.dir property. When free space drops below this threshold, no more tasks are requested until all current tasks finish and clean up. When free space is below this threshold, running tasks are killed in the following order until free space is above the threshold:

  • Reduce tasks
  • All other tasks in reverse percent-completed order.

Default value: 0

mapred.local.dir.minspacestart

This property's value specifies a free space threshold for the directory specified by mapred.local.dir. No tasks are requested while free space is below this threshold.

Default value: 0

mapred.local.dir

This property's value specifies the directory where MapReduce localized job files. Localized job files are the job-related files downloaded by the TaskTracker and include the job configuration, job JAR file, and files added to the DistributedCache. Each task attempt has a dedicated subdirectory under the mapred.local.dir directory. Shared files are symbolically linked to those subdirectories.

Default value: /tmp/mapr-hadoop/mapred/local

mapred.map.child.java.opts

This property stores Java options for map tasks. When present, the @taskid@ symbol is replaced with the current TaskID. As an example, to enable verbose garbage collection logging to a file named for the taskid in /tmp and to set the heap maximum to 1GB, set this property to the value -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc . The configuration variable mapred.{map/reduce}.child.ulimit controls the maximum virtual memory of the child processes. In the MapR distribution for Hadoop, the default -Xmx is determined by memory reserved for mapreduce by the TaskTracker. Reduce tasks use memory than map tasks.

For information about the memory allocated to map tasks, see Resource Allocation for Jobs and Applications.

mapred.map.child.log.level

This property's value sets the logging level for the map task. The allowed levels are:

  • OFF
  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE
  • ALL

Default value: INFO

mapred.map.max.attempts

Expert: This property's value sets the maximum number of attempts per map task.

Default value: 4

mapred.map.output.compression.codec

Specifies the compression codec to use to compress map outputs if compression of map outputs is enabled.

Default value: org.apache.hadoop.io .compress.DefaultCodec

mapred.maptask.memory.default

When the value of the mapred.tasktracker.map.tasks.maximum parameter is -1, this parameter specifies a size in MB that is used to determine the default total number of map task slots on this node.

Default value: 800

mapred.map.tasks

The default number of map tasks per job. Ignored when the value of the mapred.job.tracker property is local.

Default value: 2

mapred.maxthreads.generate.mapoutput

Expert: Number of intra-map-task threads to sort and write the map output partitions.

Default value: 1

mapred.maxthreads.partition.closer

Expert: Number of threads that asynchronously close or flush map output partitions.

Default value: 1

mapred.merge.recordsBeforeProgress

The number of records to process during a merge before sending a progress notification to the TaskTracker.

Default value: 10000

mapred.min.split.size

The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting.

Default value: 0

mapred.output.compress

Set this property's value to True to compress job outputs.

Default value: False

mapred.output.compression.codec

When job output compression is enabled, this property's value specifies the compression codec.

Default value: org.apache.hadoop.io.compress.DefaultCodec

mapred.output.compression.type

When job outputs are compressed as SequenceFiles, this value's property specifies how to compress the job outputs. Legal values are:

  • NONE
  • RECORD
  • BLOCK

Default value: RECORD

mapred.queue.default.state

This property's value defines the state of the default queue, which can be either STOPPED or RUNNING. This value can be changed at runtime.

Default value: RUNNING

mapred.queue.names

This property's value specifies a comma-separated list of the queues configured for this JobTracker. Jobs are added to queues and schedulers can configure different scheduling properties for the various queues. To configure a property for a queue, the name of the queue must match the name specified in this value. Queue properties that are common to all schedulers are configured here with the naming convention mapred.queue.$QUEUE-NAME.$PROPERTY-NAME . The number of queues configured in this parameter can depend on the type of scheduler being used, as specified in mapred.jobtracker.taskScheduler. For example, the JobQueueTaskScheduler supports only a single queue, which is the default configured here. Verify that the schedule supports multiple queues before adding queues.

Default value: default

mapred.reduce.child.log.level

The logging level for the reduce task. The allowed levels are:

  • OFF
  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE
  • ALL

Default value: INFO

mapred.reduce.copy.backoff

This property's value specifies the maximum amount of time in seconds a reducer spends on fetching one map output before declaring the fetch failed.

Default value: 300

mapred.reduce.max.attempts

Expert: The maximum number of attempts per reduce task.

Default value: 4

mapred.reducetask.memory.default

When the value of the mapred.tasktracker.reduce.tasks.maximum parameter is -1, this parameter specifies a size in MB that is used to determine the default total number of reduce task slots on this node.

Default value: 1500

mapred.skip.attempts.to.start.skipping

This property's value specifies a number of task attempts. After that many task attempts, skip mode is active. While skip mode is active, the task reports the range of records which it will process next to the TaskTracker. With this record range, the TaskTracker is aware of which records are dubious and skips dubious records on further executions.

Default value: 2

mapred.skip.map.auto.incr.proc.count

SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the map function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly.

Default value: True

mapred.skip.map.max.skip.records

The number of acceptable skip records around the bad record, per bad record in the mapper. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to prevent the framework from narrowing down the skipped range.

Default value: 0

mapred.skip.reduce.auto.incr.proc.count

SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the reduce function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly.

Default value: True

mapred.skip.reduce.max.skip.groups

The number of acceptable skip records around the bad record, per bad record in the reducer. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to prevent the framework from narrowing down the skipped range.

Default value: 0

mapred.submit.replication

This property's value specifies the replication level for submitted job files. As a best practice, set this value to approximately the square root of the number of nodes.

Default value: 10

mapred.task.cache.levels

This property's value specifies the maximum level of the task cache. For example, if the level is 2, the tasks cached are at the host level and at the rack level.

Default value: 2

mapred.task.calculate.resource.usage

Set this property's value to False to prevent the use of the ${mapreduce.tasktracker.resourcecalculatorplugin} parameter.

Default value: True

mapred.task.profile

Set this property's value to True to enable task profiling and the collection of profiler information by the system.

Default value: Fals e

mapred.task.profile.maps

This property's value sets the ranges of map tasks to profile. This property is ignored when the value of the mapred.task.profile property is set to False.

Default value: 0-2

mapred.task.profile.reduces

This property's value sets the ranges of reduce tasks to profile. This property is ignored when the value of the mapred.task.profile property is set to False.

Default value: 0-2

mapred.task.timeout

This property's value specifies a time in milliseconds after which a task terminates if the task does not perform any of the following:

  • reads an input
  • writes an output
  • updates its status string

Default value: 60 0000

mapred.tasktracker.dns.interface

This property's value specifies the name of the network interface that the TaskTracker reports its IP address from.

Default value: default

mapred.tasktracker.dns.nameserver

This property's value specifies the host name or IP address of the name server (DNS) that the TaskTracker uses to determine the JobTracker's hostname.

Default value: default