I'm curious about Agentic SRE solutions. I've tried demos from a few prominent companies, and while they claim to automatically understand infrastructure and resolve issues without human intervention, it seems they mostly just summarize problems. I haven't come across any tools that actually fix issues on their own. Are there any tools out there that genuinely work in that way?
4 Answers
I experimented with a managed control plane (MCP) to diagnose cluster issues. By asking questions about how various components relate and fit together, it provided some insightful results. However, I wouldn’t trust AI to make actual changes to my clusters—way too unreliable. Definitely wouldn't recommend it for that.
Did you build the MCP yourself or start with an existing one and adjust it to be read-only?
After spending a month creating a Cost SRE bot, I realized that users are not keen on having an AI guess their node sizes or make any changes. I ended up removing all the AI components and just went with straightforward calculations. It feels like we need more reliable tools rather than relying on Agentic solutions for infrastructure management.
Finally! Someone speaks sense!
Honestly, I think it’s mostly hype. Sure, it can scan logs and configs, or generate documentation, but letting AI handle anything that involves making changes directly would just be asking for trouble.
I’ve heard that Datadog has a module focused on this kind of automation, but I haven't had the chance to try it out.

I hear you! The MCP can generate good ideas. I’d only use it in read-only mode. Seems like the marketing around Agentic SRE is just a lot of hype. We've been using Komodor, and its automatic diagnosis and post-mortem feature are actually pretty impressive.