We've implemented an auto-kill switch for our production EKS clusters and the results have been impressive—we saved over $23,000 in a year! Initially, we relied on passive alerts for rogue scaling events or leftover nodes, but that only led to higher bills before anyone even noticed. Switching to Voidburn allowed us to enforce a hard budget for production workloads and node groups. The system automatically terminates instances that exceed their budget limits, which has stopped about $1,943 in monthly waste!
When a production workload exceeds its budget, the enforcer takes a snapshot and logs the instance state, so if a termination was incorrect or urgent, we can resume quickly without losing data. This gives us confidence to trust the "kill switch" much more, having clear audit trails for compliance and strict rules governing what gets terminated. For those managing high-scale environments, I'm curious how others tackle runaway production costs—are you using alerts or have you switched to automated systems?
2 Answers
Your devops lead takes 12 hours to wake up to a page? Seriously? That’s quite a failure for an on-call process—how can anyone manage that kind of delay? Sounds like your approach has definitely improved since you shared that. Also, love the idea of using snapshots and checkpoints to reduce risks!
I love hearing about real-world implementations like this! It's super impressive that you've saved so much. It's true that relying only on alerts isn't enough—by the time someone reacts, the damage is done. Automating those responses is definitely the way to go. I’ve seen teams waste so much time because they rely on human monitoring rather than having hard limits set.

Yeah, we've really tightened our on-call process to make sure responses are quick. The manual resume feature also helps us avoid unnecessary downtime if there's a mistake.