I'm diving into infrastructure work from a non-DevOps background, and it's been an exciting journey learning about optimization. I'm curious about what the ideal tool for infrastructure optimization would realistically include. What features should it have to make our lives easier?
1 Answer
For me, the ideal tool should clearly connect cost, performance, and reliability to specific services and owners—not just CPU and memory metrics. It should establish a baseline for normal behavior, flag real anomalies, and simulate the impacts of right-sizing or changing instance classes on both latency and costs before any changes are made. Enforcing strong tagging and providing visibility into unused resources, idle workloads, and over-provisioned clusters would be essential. Bonus points if it can learn workload patterns over time to minimize reactions to short-term spikes.
These are great points! While we’re not there yet with AI, how are people currently using LLMs to make infrastructure decisions?

So basically you're asking for magic, right?