Hey folks! I'm in the process of migrating a massive environment from on-prem to Azure, and I've been running differential rsyncs every few days to prepare for the cutover. One of the toughest cases involves a 5TB NFS share with around 22 million files. The delta syncs are taking upwards of 3 days, and I've tried everything from tweaking nconnect settings to noatime and various rsync options, but nothing seems to help. It's tough because my directory structure is nested and uneven in terms of file count, making it hard to break things down. I'm not limited by bandwidth or VM resources; it just takes a long time to compare the metadata of all these files. Any hacks or suggestions to speed things up would be greatly appreciated!
9 Answers
Have you thought about using rclone? You could run multiple rclone instances in parallel for a big speed boost. What I did was create a list of files to transfer per directory structure, which worked out well despite the initial listing taking a bit of time.
Just a thought — if you're using an rsync share on the receiving end, you might find better efficiency by running rsync as a daemon. You can use the `rsync` command with a `::` syntax to invoke the daemon, which can help streamline the process.
If preserving file permissions isn't critical, try Resilio Sync. After your initial hashes are completed, it uses inotify to monitor changes and only transfers what’s necessary, which could save you time in the long run. It's super efficient for ongoing syncing.
The main issue with rsync is its metadata overhead, especially with such a huge number of files. If you're okay with a bit of a workaround, try using tar combined with mbuffer for the transfer. It performs well but won't allow for file updates afterwards. Here's a quick command: `tar -cf - /data/folder | mbuffer -m 8G | tar -xf - -C /data/folder`. You can track the speed with `pv` and for updates, use a `find` command like this to copy recent files: `find /data/folder -type f -mtime -7 -exec cp --parents {} /data/folder;`. Alternatively, kick off parallel streams with `xargs` to reduce the strain.
Oh, the trials of learning rsync! It's known for being a bit painful with large amounts of files over NFS. But really, have you considered azcopy? It's designed for just these kinds of heavy transfers and might save you a lot of headaches!
For a one-time 5TB transfer, I'd suggest creating a compressed archive for seeding your cloud environment first, then use rsync for any delta updates later. If you want to go direct from on-site to cloud, you can execute a command like `(cd /src/dir && tar cf - .) | ssh user@host "(cd /dst/dir && tar xvf -)"`. This could be more efficient compared to rsync for the initial copy, especially if you utilize compression!
Last time I worked with rsync, I found it defaulted to a single thread, which seriously bottlenecked the transfer. As soon as I switched to a multi-threaded setup, it really ramped up my bandwidth usage and sped things along!
You could also consider giving rclone a shot! It may be beneficial for your use case, especially with the volumes you're handling.
Compress your files into larger chunks! With so many small files, accessing them takes significantly longer than transferring larger files, so this could speed things up substantially.
I’m actually exploring fspsync now, and I’ve already done the initial copy, so I’m focusing on optimizing subsequent updates.