I've got a flash array set up to create snapshots every 5 minutes, and it's been stable for about three years. Typically, these snapshots have been around 200MB each, but over the weekend, I started seeing them balloon to over 25GB! Storage utilization seems stable, with no substantial increases, yet my flash array is near 100% capacity. The server admins mentioned they activated SQL auditing, which might be contributing to all the data changing without affecting storage use much.
Currently, one of my cluster nodes is showing really high data transfer rates—2.4Gbps—while the other nodes have much lower rates. To solve this crisis, I'm temporarily disabling snapshots and immutability to manage space, but I'm trying to figure out how to pinpoint which VM is causing these massive checkpoint sizes. I'm using Server 2022 and need guidance on which performance monitor stats would be best to track.
1 Answer
So, why are you snapshotting every 5 minutes? It sounds like you have some process generating loads of data changes without actually adding more data, which fits with the SQL audit idea. You should check your statistics for sustained writes—like you mentioned seeing high transfer rates. It might be a good idea to talk to your server admins to figure things out together, and if they can’t find the issue, throttling their disk writes could help you identify the source of the problem.

That's an interesting thought! I did ask the admins to turn off auditing, but they balked at the downtime it would cause. The owners pushed for 5-minute checkpoints since they want to have quick recovery options. It’s frustrating since we’ve never hit 75% utilization before!