[bash] Speed up rsync with Simultaneous/Concurrent File Transfers?

We need to transfer 15TB of data from one server to another as fast as we can. We're currently using rsync but we're only getting speeds of around 150Mb/s, when our network is capable of 900+Mb/s (tested with iperf). I've done tests of the disks, network, etc and figured it's just that rsync is only transferring one file at a time which is causing the slowdown.

I found a script to run a different rsync for each folder in a directory tree (allowing you to limit to x number), but I can't get it working, it still just runs one rsync at a time.

I found the script here (copied below).

Our directory tree is like this:

/main
   - /files
      - /1
         - 343
            - 123.wav
            - 76.wav
         - 772
            - 122.wav
         - 55
            - 555.wav
            - 324.wav
            - 1209.wav
         - 43
            - 999.wav
            - 111.wav
            - 222.wav
      - /2
         - 346
            - 9993.wav
         - 4242
            - 827.wav
      - /3
         - 2545
            - 76.wav
            - 199.wav
            - 183.wav
         - 23
            - 33.wav
            - 876.wav
         - 4256
            - 998.wav
            - 1665.wav
            - 332.wav
            - 112.wav
            - 5584.wav

So what I'd like to happen is to create an rsync for each of the directories in /main/files, up to a maximum of, say, 5 at a time. So in this case, 3 rsyncs would run, for /main/files/1, /main/files/2 and /main/files/3.

I tried with it like this, but it just runs 1 rsync at a time for the /main/files/2 folder:

#!/bin/bash

# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"

# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5

# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
    # Make sure to ignore the parent folder
    if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
    then
        # Strip leading dot slash
        subfolder=$(echo "${dir}" | sed 's@^\./@@g')
        if [ ! -d "${target}/${subfolder}" ]
        then
            # Create destination folder and set ownership and permissions to match source
            mkdir -p "${target}/${subfolder}"
            chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
            chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
        fi
        # Make sure the number of rsync threads running is below the threshold
        while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
        do
            echo "Sleeping ${sleeptime} seconds"
            sleep ${sleeptime}
        done
        # Run rsync in background for the current subfolder and move one to the next one
        nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
    fi
done

# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"

This question is related to bash shell ubuntu-12.04 rsync simultaneous

The answer is


The shortest version I found is to use the --cat option of parallel like below. This version avoids using xargs, only relying on features of parallel:

cat files.txt | \
  parallel -n 500 --lb --pipe --cat rsync --files-from={} user@remote:/dir /dir -avPi

#### Arg explainer
# -n 500           :: split input into chunks of 500 entries
#
# --cat            :: create a tmp file referenced by {} containing the 500 
#                     entry content for each process
#
# user@remote:/dir :: the root relative to which entries in files.txt are considered
#
# /dir             :: local root relative to which files are copied

Sample content from files.txt:

/dir/file-1
/dir/subdir/file-2
....

Note that this doesn't use -j 50 for job count, that didn't work on my end here. Instead I've used -n 500 for record count per job, calculated as a reasonable number given the total number of records.


You can use xargs which supports running many processes at a time. For your case it will be:

ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/

I've developed a python package called: parallel_sync

https://pythonhosted.org/parallel_sync/pages/examples.html

Here is a sample code how to use it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)

parallelism by default is 10; you can increase it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)

however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.


There are a number of alternative tools and approaches for doing this listed arround the web. For example:

  • The NCSA Blog has a description of using xargs and find to parallelize rsync without having to install any new software for most *nix systems.

  • And parsync provides a feature rich Perl wrapper for parallel rsync.


Have you tried using rclone.org?

With rclone you could do something like

rclone copy "${source}/${subfolder}/" "${target}/${subfolder}/" --progress --multi-thread-streams=N

where --multi-thread-streams=N represents the number of threads you wish to spawn.


The simplest I've found is using background jobs in the shell:

for d in /main/files/*; do
    rsync -a "$d" remote:/main/files/ &
done

Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.

You could add

while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done

inside the loop for a primitive form of job control.


Updated answer (Jan 2020)

xargs is now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsync tasks the command would be:

ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/

This will list all folders in /srv/mail, pipe them to xargs, which will read them one-by-one and and run 4 rsync processes at a time. The % char replaces the input argument for each command call.

Original answer using parallel:

ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}

rsync transfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp (for example). rsync is even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.

A simpler way to run rsync in parallel would be to use parallel. The command below would run up to 5 rsyncs in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.

run_rsync() {
    # e.g. copies /main/files/blah to /main/filesTest/blah
    rsync -av "$1" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"

Examples related to ubuntu-12.04

Increasing Heap Size on Linux Machines wget: unable to resolve host address `http' How to check which PHP extensions have been enabled/disabled in Ubuntu Linux 12.04 LTS? java.lang.Exception: No runnable methods exception in running JUnits Speed up rsync with Simultaneous/Concurrent File Transfers? MySQL Job failed to start CronJob not running libz.so.1: cannot open shared object file error: could not create '/usr/local/lib/python2.7/dist-packages/virtualenv_support': Permission denied Package doesn't exist error in intelliJ

Examples related to rsync

Speed up rsync with Simultaneous/Concurrent File Transfers? How does `scp` differ from `rsync`? How to rsync only a specific list of files? rsync: difference between --size-only and --ignore-times rsync copy over only certain types of files using include option rsync - mkstemp failed: Permission denied (13) Using Rsync include and exclude options to include directory and file by pattern Copying files using rsync from remote server to local machine Why is this rsync connection unexpectedly closed on Windows? Copy or rsync command

Examples related to simultaneous

Speed up rsync with Simultaneous/Concurrent File Transfers?