I'm curious about how AWS volume snapshots are created without needing to pause or shut down the server. Since snapshots are supposed to be point-in-time backups, how can AWS handle it when the data is changing during the snapshot process? I've heard it's safe not to turn off the server for backups, but I'm wondering what happens under the hood, especially at the code level. I appreciate any insight into the technical implementation, and sorry if my English isn't perfect!
6 Answers
If you want to dive deeper into snapshots, check out this segment from 2019 re:Invent about EBS Snapshots. It covers how the whole process works and might help answer some of your questions: [How does an EBS Snapshot work](https://youtu.be/07Wg67qnpKw?t=311).
On a more conceptual level, imagine you have a queue for all actions performed on a disk. When you take a snapshot, you note the state of that queue; meanwhile, you continue syncing the disk as needed. That way, any reads come from the disk, but you also check the queue for any updates before providing the data. The snapshot process does mean a slight dip in read performance during the creation, but it’s a manageable trade-off for getting a clean state backup.
You might also want to look into how snapshots work in different systems, like LVM. It uses a copy-on-write approach, which only saves changed blocks. Remember that taking a snapshot of an active application, like a database, without stopping it can lead to 'bad' snapshots, so that’s worth considering too!
AWS handles snapshots in a pretty clever way. Essentially, when you create a snapshot, it makes a new layer on top of the existing data, using a system similar to other storage tech where the top layer can be edited while the lower ones are read-only. There might be a tiny pause in I/O operations, but it's so quick that most people won’t even notice it. This process allows AWS to keep a read-only copy of your data right at the moment you clicked 'create snapshot', which can then be stored safely in S3 or elsewhere. Microsoft’s Windows has a similar method called VSS that flushes I/O during snapshots to ensure data is consistent, but it's not a strict requirement for AWS, since many workloads can handle the brief interval without a hitch.
Right, "quiescing" the apps is key for making sure you get the best snapshot.
I'm also wondering why creating snapshots sometimes seems to take so long. Any insights on that?
It really depends on the needs of your applications. Some servers can take snapshots safely without pausing anything because they aren't actively changing data. In my experience, 99% of our servers will stop for backups, but some can manage just fine without a pause. Thinking of EBS like an iSCSI SAN can help you understand how AWS manages snapshots and replication conceptually, even if they use different methods.
That makes a lot of sense! Thanks for breaking it down.