Hey everyone! I'm in the process of moving a large environment from on-premises to Azure, and I've been doing delta syncs with rsync every few days to prepare for the cutover. I'm currently dealing with a pretty rough situation—22 million files totaling about 5TB. Unfortunately, these delta syncs are taking over 3 days to complete, which seems excessive. I've tried tweaking various settings like nconnect and not using atime, as well as several other professional suggestions I could think of. I'm running it on an Azure VM with an on-premises Isilon share and an Azure NFS share mounted. Splitting the directories for multi-threading hasn't helped much since they're quite nested and unbalanced in file counts. I'm looking for any suggestions or maybe some hacky solutions to speed this up. I don't have limitations on bandwidth or VM resources. It just takes an enormous amount of time to compare the metadata of these 22 million files. Any ideas?
6 Answers
If you're open to alternatives, rclone might work better for you. You can use 'parallel' to run multiple instances of rclone to speed things up significantly. Though gathering the file lists takes time, the actual transfer is pretty snappy with the right setup!
Rsync can be pretty sluggish for large file transfers, especially from network shares like NFS. Have you considered using azcopy instead? It's specifically designed for cloud transfers and might serve you better.
If preserving file permissions isn't crucial, Resilio Sync could be a good alternative. After the initial hash is calculated, it monitors for changes and only transfers new data, which saves a ton of time. It uses a bit torrent protocol to manage this efficiently.
If the initial copy of 5TB is all that matters right now, consider creating a compressed archive for seeding, unpack it in the cloud, and then just run rsync for any delta updates. If you need to do the transfer directly, I suggest using tar for the initial copy: `(cd /src/dir && tar cf - .) | (cd /dst/dir && tar xvf -)`. You could use SSH or add mbuffer to enhance the transfer speed.
The main bottleneck with rsync is definitely its handling of metadata, especially with such a massive number of files. A simpler alternative could be using tar combined with mbuffer for maximum performance; however, that approach won't support updates to files. You might want to run something like this: `tar -cf - /data/folder | mbuffer -m 8G | tar -xf - -C /data/folder`. Make sure to track the speed using `pv` if you go this route. For updated files, you could use the `find` command to only copy files modified in the last week.
When I last checked, rsync works on a single thread by default. As soon as I switched to multi-threading, I was able to fully utilize my bandwidth. Perhaps give that a go!
I’ve actually completed the initial copies and I’m exploring nsync now. I thought nconnect might help with metadata handling, but it didn’t improve anything.