[hadoop] How to copy data from one HDFS to another HDFS?

I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other command line?

This question is related to hadoop hdfs bigdata sqoop

The answer is


It's also useful to note that you can run the underlying MapReduce jobs with either the source or target cluster like so:

hadoop --config /path/to/hadoop/config distcp <src> <dst>

Try dtIngest, it's developed on top of Apache Apex platform. This tool copies data from different sources like HDFS, shared drive, NFS, FTP, Kafka to different destinations. Copying data from remote HDFS cluster to local HDFS cluster is supported by dtIngest. dtIngest runs yarn jobs to copy data in parallel fashion, so it's very fast. It takes care of failure handling, recovery etc. and supports polling directories periodically to do continious copy.

Usage: dtingest [OPTION]... SOURCEURL... DESTINATIONURL example: dtingest hdfs://nn1:8020/source hdfs://nn2:8020/dest


distcp command use for copying from one cluster to another cluster in parallel. You have to set the path for namenode of src and path for namenode of dst, internally it use mapper.

Example:

$ hadoop distcp <src> <dst>

there few options you can set for distcp

-m for no. of mapper for copying data this will increase speed of copying.

-atomic for auto commit the data.

-update will only update data that is in old version.

There are generic command for copying files in hadoop are -cp and -put but they are use only when the data volume is less.


distcp is used for copying data to and from the hadoop filesystems in parallel. It is similar to the generic hadoop fs -cp command. In the background process, distcp is implemented as a MapReduce job where mappers are only implemented for copying in parallel across the cluster.

Usage:

  • copy one file to another

    % hadoop distcp file1 file2

  • copy directories from one location to another

    % hadoop distcp dir1 dir2

If dir2 doesn't exist then it will create that folder and copy the contents. If dir2 already exists, then dir1 will be copied under it. -overwrite option forces the files to be overwritten within the same folder. -update option updates only the files that are changed.

  • transferring data between two HDFS clusters

    % hadoop distcp -update -delete hdfs://nn1/dir1 hdfs://nn2/dir2

-delete option deletes the files or directories from the destination that are not present in the source.


DistCp (distributed copy) is a tool used for copying data between clusters. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Usage: $ hadoop distcp <src> <dst>

example: $ hadoop distcp hdfs://nn1:8020/file1 hdfs://nn2:8020/file2

file1 from nn1 is copied to nn2 with filename file2

Distcp is the best tool as of now. Sqoop is used to copy data from relational database to HDFS and vice versa, but not between HDFS to HDFS.

More info:

There are two versions available - runtime performance in distcp2 is more compared to distcp


Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop Filesystems in parallel. The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of hadoop, then the hdfs scheme is appropriate to use.

$ hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar

The data in /foo directory of namenode1 will be copied to /bar directory of namenode2. If the /bar directory does not exist, it will create it. Also we can mention multiple source paths.

Similar to rsync command, distcp command by default will skip the files that already exist. We can also use -overwrite option to overwrite the existing files in destination directory. The option -update will only update the files that have changed.

$ hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo

distcp can also be implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There will be no reducers.

If trying to copy data between two HDFS clusters that are running different versions, the copy will process will fail, since the RPC systems are incompatible. In that case we need to use the read-only HTTP based HFTP filesystems to read from the source. Here the job has to run on destination cluster.

$ hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar

50070 is the default port number for namenode's embedded web server.