hadoop distcp
The hadoop distcp
command is a tool used for large
inter- and intra-cluster copying. It uses MapReduce to effect its
distribution, error handling and recovery, and reporting. It
expands a list of files and directories into input to map tasks,
each of which will copy a partition of the files specified in the
source list.
Syntax
hadoop [ Generic Options ] distcp
[-p [rbugp] ]
[-i ]
[-log ]
[-m ]
[-overwrite ]
[-update ]
[-f <URI list> ]
[-filelimit <n> ]
[-sizelimit <n> ]
[-delete ]
<source>
<destination>
Parameters
Command Options
The following command options are supported for the hadoop
distcp
command:
Parameter |
Description |
---|---|
|
Specify the source URL. |
|
Specify the destination URL. |
|
Preserve
|
|
Ignore failures. As explained in the below, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted. |
|
Write logs to |
|
Maximum number of simultaneous copies. Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput. See Map Sizing. |
|
Overwrite destination. If a map fails and |
|
Overwrite if |
|
Use list at <URI list> as source list. This is equivalent to listing each source on the command line. The value of <URI list> must be a fully qualified URI. |
|
Limit the total number of files to be <= n. See Symbolic Representations. |
|
Limit the total size to be <= n bytes. See Symbolic Representations. |
|
Delete the files existing in the
|
Generic Options
The hadoop distcp
command supports the following
generic options: -conf <configuration file>
,
-D <property=value>
, -fs <local|file
system URI>
, -jt
<local|jobtracker:port>
, -files
<file1,file2,file3,...>
, -libjars
<libjar1,libjar2,libjar3,...>
, and -archives
<archive1,archive2,archive3,...>
.
For more information on generic options, see Generic
Options.
Symbolic Representations
The parameter <n>
in -filelimit
and -sizelimit
can be specified with symbolic
representation. For example,
- 1230k = 1230 * 1024 = 1259520
- 891g = 891 * 1024^3 = 956703965184
Map Sizing
The hadoop distcp
command attempts to size each map
comparably so that each copies roughly the same number of bytes.
Note that files are the finest level of granularity, so increasing
the number of simultaneous copiers (i.e. maps) may not always
increase the number of simultaneous copies nor the overall
throughput.
If -m
is not specified, distcp
will
attempt to schedule work for min (total_bytes /
bytes.per.map, 20 * num_task_trackers)
where
bytes.per.map
defaults to 256MB.
Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs.
Examples
Basic inter-cluster copying
The hadoop
distcp
commmand is most often used to copy files between clusters:
hadoop distcp maprfs:///mapr/cluster1/foo \
maprfs:///mapr/cluster2/bar
The command in the example expands the namespace under /foo/bar
on
cluster1 into a temporary file, partitions its contents among a set of map tasks, and
starts a copy on each TaskTracker from cluster1 to cluster2. Note that the
hadoop distcp
command expects absolute paths.
Only those files that do not already exist in the destination are copied over from the source directory.
Updating files between clusters
Use the hadoop
distcp -update
command to synchronize changes between clusters.
$ hadoop distcp -update maprfs:///mapr/cluster1/foo maprfs:///mapr/cluster2/bar/foo
Files in the /foo
subtree are copied from cluster1 to cluster2 only if
the size of the source file is different from that of the size of the destination file.
Otherwise, the files are skipped over.
Note that using the -update
option changes distributed copy interprets
the source and destination paths making it necessary to add the trailing
/foo
subdirectory in the second cluster.
Overwriting files between clusters
By default, distributed copy skips files that already exist in
the destination directory, but you can overwrite those files using the
-overwrite
option. In this example, multiple source directories are
specified:
$ hadoop distcp -overwrite maprfs:///mapr/cluster1/foo/a \
maprfs:///mapr/cluster1/foo/b \
maprfs:///mapr/cluster2/bar
As with using the -update
option, using the -overwrite
changes the way that the source and destination paths are interpreted by distributed
copy: the contents of the source directories are compared to the contents of the
destination directory. The distributed copy aborts in case of a
conflict.
Migrating Data from HDFS to MapR-FS
The hadoop distcp
command can be used to migrate data from an HDFS
cluster to a MapR-FS where the HDFS cluster uses the same version of the RPC protocol as
that used by MapR. For a discussion, see Copying Data from Apache Hadoop.
$ hadoop distcp namenode1:50070/foo maprfs:///bar
You must specify the IP address and HTTP port (usually 50070) for the namenode on the HDFS cluster.