I'm curious if anyone here has successfully integrated AI agents or models into their DevOps or SRE practices. I'm looking for anything that might enhance deployment processes or help with incident management. Would love to hear your experiences!
4 Answers
I've been working on a project that helps analyze incidents by searching through logs, metrics, and past data to suggest possible causes and fixes. It's challenging to make it genuinely trustworthy, but it's wild how much faster it can check through issues than a human could! Integrating historical data from previous incidents really helps too.
I wrote a script that pulls K8s pod crash logs, feeds them to ChatGPT to create a summary, and then sends that to Slack. It's not perfect, but I've definitely caught some little misconfigurations by having AI look at the code.
I'm starting to play around with an MCP server for a basic incident management system. It's very much in the early stages, but I'm hopeful about where it could go.
I'm not so sure about relying on AI for this stuff. The idea is to have humans you can trust in the loop, especially when things go sideways. You really don’t want to leave critical downtime to a machine, right?
Also, while I do see your point, I can't help but think an AI could at least speed up processes like filling out RCAs, even if it's not managing incidents completely.
I have my doubts about MCP being a game-changer, but parsing logs does seem like it could add some value.