Job Configuration
Set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, you can set these parameters from the command line. Example:
hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"
When you submit a job, the JobClient creates job.xml
by reading
parameters from the following files in the following order:
-
mapred-default.xml (MapReduce v1)
- The local
mapred-site.xml
- overrides identical parameters inmapred-default.xml
- Any settings in the job code itself - overrides identical parameters in
mapred-site.xml
Parameter |
Description |
---|---|
keep.failed.task.files |
Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed. Default value: |
mapred.job.reuse.jvm.num.tasks |
How many tasks to run per jvm. If set to -1, there is no limit. Default value: |
mapred.job.impact.blacklisting |
Specifies whether failures for a job should count toward the number specified by the TaskTracker parameter mapred.max.tracker.blacklists. Default value: |
mapred.map.tasks.speculative.execution |
If true, then multiple instances of some map tasks may be executed in parallel. Default value: |
mapred.reduce.tasks.speculative.execution |
If true, then multiple instances of some reduce tasks may be executed in parallel. Default value: |
mapred.reduce.tasks |
The default number of reduce tasks per job. Typically set to 99% of the
cluster's reduce capacity, so that if a node fails the reduces can still be
executed in a single wave. Ignored when the value of the
Default value: |
mapred.job.map.memory.physical.mb |
Maximum physical memory limit for map task of this job. If limit is exceeded task attempt will be FAILED. |
mapred.job.reduce.memory.physical.mb |
Maximum physical memory limit for reduce task of this job. If limit is exceeded task attempt will be FAILED. |
mapreduce.task.classpath.user.precedence |
Set to true if user wants to set different classpath. Default value: |
mapred.max.maps.per.node |
Per-node limit on running map tasks for the job. A value of -1 signifies no limit. Default value: |
mapred.max.reduces.per.node |
Per-node limit on running reduce tasks for the job. A value of -1 signifies no limit. Default value: |
mapred.running.map.limit |
Cluster-wide limit on running map tasks for the job. A value of -1 signifies no limit. Default value: |
mapred.running.reduce.limit |
Cluster-wide limit on running reduce tasks for the job. A value of -1 signifies no limit. Default value: |
mapreduce.tasktracker.cache.local.numberdirectories |
This property's value sets the maximum number of subdirectories to create in a given distributed cache store. Cache items in excess of this limit are expunged whether or not the total size threshold is exceeded. Default value: |
mapred.reduce.child.java.opts |
Java opts for the reduce tasks. MapR Default heapsize (-Xmx) is determined by memory reserved for mapreduce at TaskTracker. Reduce task is given more memory than map task. Default memory for a reduce task = (Total Memory reserved for mapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots)) Default value:
|
mapred.reduce.child.ulimit |
|
io.sort.factor |
The number of streams to merge simultaneously during file sorting. The value of this property determines the number of open file handles. Default value: |
io.sort.mb |
This value sets the size, in megabytes, of the memory buffer that holds map outputs before writing the final map outputs. Lower values for this property increases the chance of spills. Recommended practice is to set this value to 1.5 times the average size of a map output. Default value: |
io.sort.record.percent |
The percentage of the memory buffer specified by the
Default value: |
io.sort.spill.percent |
This property's value sets the soft limit for either the buffer or record collection buffers. Threads that reach the soft limit begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. Do not reduce this value below 0.5. Default value: |
mapred.reduce.slowstart.completed.maps |
Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. Default value: |
mapreduce.reduce.input.limit |
The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set. Default value: |
mapred.reduce.parallel.copies |
The default number of parallel transfers run by reduce during the copy(shuffle) phase. Default value: |
jobclient.completion.poll.interval |
This property's value specifies the JobClient's polling frequency in milliseconds to the JobTracker for updates about job status. Reduce this value for faster tests on single node systems. Adjusting this value on production clusters may result in undesired client-server traffic. Default value: |
jobclient.output.filter |
This property's value specifies the filter that controls the output of the task's userlogs that are sent to the JobClient's console. Legal values are:
Default value: FAILED |
jobclient.progress.monitor.poll.interval |
This property's value specifies the JobClient's status reporting frequency in milliseconds to the console and checking for job completion. Default value: 1 |
job.end.notification.url |
This property's value specifies the URL to call at job completion to report
the job's end status. Only two variables are legal in the URL,
Default value: |
job.end.retry.attempts |
This property's value specifies the maximum number of times that Hadoop attempts to contact the notification URL. Default value: |
job.end.retry.interval |
This property's value specifies the interval in milliseconds between attempts to contact the notification URL. Default value: |
keep.failed.task.files |
Set this property's value to True to keep files for failed tasks. Because this storage is not automatically reclaimed by the system, keep files only for jobs that are failing. Setting this property's value to True also keeps map outputs in the reduce directory as the map outputs are consumed instead of deleting the map outputs on consumption. Default value: |
local.cache.size |
This property's value specifies the number of bytes allocated to each local TaskTracker directory to store Distributed Cache data. Default value: |
mapr.centrallog.dir |
This property's value specifies the relative path under a local volume path
that points to the central log location,
Default value: |
mapr.localvolumes.path |
The path for local volumes. Default value: |
map.sort.class |
The default sort class for sorting keys. Default value: |
tasktracker.http.threads |
The number of worker threads that for the HTTP server. Default value: |
topology.node.switch.mapping.impl |
The default implementation of the DNSToSwitchMapping. It invokes a script
specified in the Default value: |
topology.script.number.args |
The max number of arguments that the script configured with the Default value: |
mapr.task.diagnostics.enabled |
Set this property's value to True to run the MapR diagnostics script before killing an unresponsive task attempt. Default value: |
mapred.acls.enabled |
This property's value specifies whether or not to check ACLs for user authorization during various queue and job level operations. Set this property's value to True to enable access control checks made by the JobTracker and TaskTracker when users request queue and job operations using Map/Reduce APIs, RPCs, the console, or the web user interfaces. Default value: |
mapred.child.oom_adj |
This property's value specifies the adjustment to the out-of-memory value for the Linux-specific out-of-memory killer. Legal values are 0-15. Default value: |
mapred.child.renice |
This property's value specifies an integer from 0 to 19 for use by the Linux nice}} utility. Default value: |
mapred.child.taskset |
Set this property's value to False to prevent running the job in a taskset.
See the manual page for Default value: |
mapred.child.tmp |
This property's value sets the location of the temporary directory for map
and reduce tasks. Set this value to an absolute path to directly assign the
directory. Relative paths are located under the task's working directory.
Java tasks execute with the option Default value: |
mapred.cluster.ephemeral.tasks.memory.limit.mb |
This property's value specifies the maximum size in megabytes for small jobs. This value is reserved in memory for an ephemeral slot. JobTracker and TaskTracker nodes must set this property to the same value. Default value: |
mapred.cluster.map.memory.mb |
This property's value sets the virtual memory size of a single map slot in
the Map-Reduce framework used by the scheduler. If the scheduler supports
this feature, a job can ask for multiple slots for a single map task via
|
mapred.cluster.max.map.memory.mb |
This property's value sets the virtual memory size of a single map task
launched by the Map-Reduce framework used by the scheduler. If the scheduler
supports this feature, a job can ask for multiple slots for a single map
task via |
mapred.cluster.max.reduce.memory.mb |
This property's value sets the virtual memory size of a single reduce task
launched by the Map-Reduce framework used by the scheduler. If the scheduler
supports this feature, a job can ask for multiple slots for a single map
task via |
mapred.cluster.reduce.memory.mb |
This property's value sets the virtual memory size of a single reduce slot
in the Map-Reduce framework used by the scheduler. If the scheduler supports
this feature, a job can ask for multiple slots for a single map task via
|
mapred.compress.map.output |
Set this property's value to True to compress map outputs with SequenceFile compresison before sending the outputs over the network. Default value: |
mapred.fairscheduler.assignmultiple |
Set this property's value to False to prevent the FairScheduler from assigning multiple tasks. Default value: |
mapred.fairscheduler.eventlog.enabled |
Set this property's value to True to enable scheduler logging in {{${HADOOP_LOG_DIR}/fairscheduler/ Default value: |
mapred.fairscheduler.smalljob.max.inputsize |
Specifies the maximum size, in bytes, that defines a small job. Default value: |
mapred.fairscheduler.smalljob.max.maps |
Specifies the maximum number of maps allowed in a small job. Default value: |
mapred.fairscheduler.smalljob.max.reducer.inputsize |
Specifies the maximum estimated input size, in bytes, for a reducer in a small job. Default value: |
mapred.fairscheduler.smalljob.max.reducers |
Specifies the maximum number of reducers allowed in a small job. Default value: |
mapred.healthChecker.interval |
Sets the frequency, in milliseconds, that the node health script runs. Default value: |
mapred.healthChecker.script.timeout |
Sets the frequency, in milliseconds, after which the node script is killed for being unresponsive and reported as failed. Default value: |
mapred.inmem.merge.threshold |
When a number of files equal to this property's value accumulate, the in-memory merge triggers and spills to disk. Set this property's value to zero or less to force merges and spills to trigger solely on RAMFS memory consumption. Default value: |
mapred.job.map.memory.mb |
Sets the virtual memory size of a single map task for the job. If the
scheduler supports this feature, a job can ask for multiple slots for a
single map task via |
mapred.job.queue.name |
Specifies the queue a job is submitted to. This property's value must match
the name of a queue defined in Default value: |
mapred.job.reduce.input.buffer.percent |
Specifies the percentage of memory relative to the maximum heap size. After the shuffle, remaining map outputs in memory must occupy less memory than this threshold value before reduce begins. Default value: |
mapred.job.reduce.memory.mb |
Sets the virtual memory size of a single reduce task for the job. If the
scheduler supports this feature, a job can ask for multiple slots for a
single map task via Default value: |
mapred.job.reuse.jvm.num.tasks |
Sets the number of tasks to run on each JVM. The default of -1 sets no limit. |
mapred.job.shuffle.input.buffer.percent |
Sets the percentage of memory allocated from the maximum heap size to storing map outputs during the shuffle. Default value: |
mapred.job.shuffle.merge.percent |
Sets a percentage of the total memory allocated to storing map outputs in
Default value: |
mapred.job.tracker.handler.count |
Sets the number of server threads for the JobTracker. As a best practice, set this value to approximately 4% of the number of TaskTracker nodes. Default value: |
mapred.job.tracker.history.completed.location |
Sets a location to store completed job history files. When this property has no value specified, completed job files are stored at ${hadoop.job.history.location}/done in the local filesystem. Default value:
|
mapred.job.tracker.http.address |
Specifies the HTTP server address and port for the JobTracker. Specify 0 as the port to make the server start on a free port. Default value: |
mapred.jobtracker.instrumentation |
Expert: The instrumentation class to associate with each JobTracker. Default value:
|
mapred.jobtracker.job.history.block.size |
Sets the block size of the job history file. Dumping job history to disk is important because job recovery uses the job history. Default value: |
mapred.jobtracker.jobhistory.lru.cache.size |
Specifies the number of job history files to load in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU. Default value: |
mapred.job.tracker |
JobTracker address ip:port or use uri maprfs:/// for default cluster or Default value: |
mapred.jobtracker.maxtasks.per.job |
Set this property's value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum. Default value: |
mapred.job.tracker.persist.jobstatus.active |
Set this property's value to True to enable persistence of job status information. Default value: |
mapred.job.tracker.persist.jobstatus.dir |
This property's value specifies the directory where job status information persists after dropping out of the memory queue between JobTracker restarts. Default value:
|
mapred.job.tracker.persist.jobstatus.hours |
This property's value specifies job status information persistence time in hours. Persistent job status information is available after the information drops out of the memory queue and between JobTracker restarts. The default value of zero disables job status information persistence. Default value: |
mapred.jobtracker.port |
The IPC port on which the JobTracker listens. Default value: |
mapred.jobtracker.restart.recover |
Set this property's value to False to disable job recovery on restart. Default value: |
mapred.jobtracker.retiredjobs.cache.size |
This property's value specifies the number of retired job statuses kept in the cache. Default value: |
mapred.jobtracker.retirejob.check |
This property's value specifies the frequency interval used by the retire job thread to check for completed jobs. Default value: |
mapred.line.input.format.linespermap |
Number of lines per split in NLineInputFormat. Default value: |
mapred.local.dir.minspacekill |
This property's value specifies a threshold of free space in the directory
specified by the
Default value: |
mapred.local.dir.minspacestart |
This property's value specifies a free space threshold for the directory
specified by Default value: |
mapred.local.dir |
This property's value specifies the directory where MapReduce localized job
files. Localized job files are the job-related files downloaded by the
TaskTracker and include the job configuration, job JAR file, and files added
to the DistributedCache. Each task attempt has a dedicated subdirectory
under the Default value: |
mapred.map.child.java.opts |
This property stores Java options for map tasks. When present, the
For information about the memory allocated to map tasks, see Resource Allocation for Jobs and Applications. |
mapred.map.child.log.level |
This property's value sets the logging level for the map task. The allowed levels are:
Default value: |
mapred.map.max.attempts |
Expert: This property's value sets the maximum number of attempts per map task. Default value: |
mapred.map.output.compression.codec |
Specifies the compression codec to use to compress map outputs if compression of map outputs is enabled. Default value: |
mapred.maptask.memory.default |
When the value of the Default value: 800 |
mapred.map.tasks |
The default number of map tasks per job. Ignored when the value of the
Default value: |
mapred.maxthreads.generate.mapoutput |
Expert: Number of intra-map-task threads to sort and write the map output partitions. Default value: |
mapred.maxthreads.partition.closer |
Expert: Number of threads that asynchronously close or flush map output partitions. Default value: |
mapred.merge.recordsBeforeProgress |
The number of records to process during a merge before sending a progress notification to the TaskTracker. Default value: |
mapred.min.split.size |
The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting. Default value: |
mapred.output.compress |
Set this property's value to True to compress job outputs. Default value: |
mapred.output.compression.codec |
When job output compression is enabled, this property's value specifies the compression codec. Default value: |
mapred.output.compression.type |
When job outputs are compressed as SequenceFiles, this value's property specifies how to compress the job outputs. Legal values are:
Default value: |
mapred.queue.default.state |
This property's value defines the state of the default queue, which can be either STOPPED or RUNNING. This value can be changed at runtime. Default value: |
mapred.queue.names |
This property's value specifies a comma-separated list of the queues
configured for this JobTracker. Jobs are added to queues and schedulers can
configure different scheduling properties for the various queues. To
configure a property for a queue, the name of the queue must match the name
specified in this value. Queue properties that are common to all schedulers
are configured here with the naming convention
Default value: |
mapred.reduce.child.log.level |
The logging level for the reduce task. The allowed levels are:
Default value: |
mapred.reduce.copy.backoff |
This property's value specifies the maximum amount of time in seconds a reducer spends on fetching one map output before declaring the fetch failed. Default value: |
mapred.reduce.max.attempts |
Expert: The maximum number of attempts per reduce task. Default value: |
mapred.reducetask.memory.default |
When the value of the Default value: |
mapred.skip.attempts.to.start.skipping |
This property's value specifies a number of task attempts. After that many task attempts, skip mode is active. While skip mode is active, the task reports the range of records which it will process next to the TaskTracker. With this record range, the TaskTracker is aware of which records are dubious and skips dubious records on further executions. Default value: |
mapred.skip.map.auto.incr.proc.count |
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the map function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly. Default value: |
mapred.skip.map.max.skip.records |
The number of acceptable skip records around the bad record, per bad record
in the mapper. The number includes the bad record. The default value of 0
disables detection and skipping of bad records. The framework tries to
narrow down the skipped range by retrying until this threshold is met OR all
attempts get exhausted for this task. Set the value to
Default value: |
mapred.skip.reduce.auto.incr.proc.count |
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS increments after MapRunner invokes the reduce function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly. Default value: |
mapred.skip.reduce.max.skip.groups |
The number of acceptable skip records around the bad record, per bad record
in the reducer. The number includes the bad record. The default value of 0
disables detection and skipping of bad records. The framework tries to
narrow down the skipped range by retrying until this threshold is met OR all
attempts get exhausted for this task. Set the value to
Default value: |
mapred.submit.replication |
This property's value specifies the replication level for submitted job files. As a best practice, set this value to approximately the square root of the number of nodes. Default value: |
mapred.task.cache.levels |
This property's value specifies the maximum level of the task cache. For example, if the level is 2, the tasks cached are at the host level and at the rack level. Default value: |
mapred.task.calculate.resource.usage |
Set this property's value to False to prevent the use of the
Default value: |
mapred.task.profile |
Set this property's value to True to enable task profiling and the collection of profiler information by the system. Default value: |
mapred.task.profile.maps |
This property's value sets the ranges of map tasks to profile. This
property is ignored when the value of the Default value: |
mapred.task.profile.reduces |
This property's value sets the ranges of reduce tasks to profile. This
property is ignored when the value of the Default value: |
mapred.task.timeout |
This property's value specifies a time in milliseconds after which a task terminates if the task does not perform any of the following:
Default value: |
mapred.tasktracker.dns.interface |
This property's value specifies the name of the network interface that the TaskTracker reports its IP address from. Default value: default |
mapred.tasktracker.dns.nameserver |
This property's value specifies the host name or IP address of the name server (DNS) that the TaskTracker uses to determine the JobTracker's hostname. Default value: default |