I got a serious wake-up call last Friday night when our GPU costs spiked to a staggering $50k a day due to engineering teams testing AI models. There were no quotas or policies in place, and if things had continued unchecked, we could've faced a $200k bill by Monday. I spent most of my day manually shutting down instances and tagging resources. This isn't the first time this has happened—it's our third spike this quarter. We have no pre-deployment checks or cost controls, and our current monitoring tools like CloudHealth only provide postmortem analysis. I'm looking for suggestions on tools or strategies that can help us shift-left without micromanaging our team too much. Any advice would be welcomed!
3 Answers
I can relate to those Friday nights! At a previous job, we accidentally spun up 200 A100 GPUs because of a runaway test script, and the bill was insane. We found that most tools are geared towards analyzing incidents afterward, not preventing them. We ended up creating custom pre-commit hooks that would estimate costs based on our configurations to help catch potential issues before merging. It’s not foolproof, but it helped us spot obvious mistakes. We also automated the termination of tagged instances after a certain time unless extended, which cut down on costs.
Have you looked into Infracost? It offers budget alerts, but it's contingent on using Terraform. It could be a useful addition to your setup if that's something you already work with.
Is it really a tooling issue? It sounds like you already have a tool in place, but if the engineers aren't responsive to the alerts, that’s a bigger problem. You might need to focus on improving the team's awareness and accountability.

I agree! Education and awareness are key. Getting the team on board with monitoring and cost awareness might help.