System Operations

Lessons Learned from Building an AI for Kubernetes Management

January 6, 2026

Asked By TechieTraveler92 On January 6, 2026

I spent two months developing an AI-powered Site Reliability Engineering (SRE) tool aimed at optimizing Kubernetes resources, but it ended up being a failure. The AI would sometimes suggest that a Redis pod required 10GB of RAM simply based on outdated blog posts, which made me realize that no responsible engineer would rely on AI to adjust critical production settings. Therefore, I scrapped the AI component and created a straightforward, deterministic linter instead. This new tool runs locally, analyzes your Helm/Manifest differences, and identifies costly changes in pull requests. It's open source, fast, keeps data private, and follows a simple calculation method: (Requests - Usage) * Blended Rate. Now I have a question for the community: I'm using a Blended Rate of $0.04/GB to keep this tool offline. Is that level of accuracy adequate for rejecting a pull request, or do you think real cloud pricing is necessary?

2 Answers

Answered By CuriousCoder99 On January 9, 2026

I'm not an expert, but if I were to approach this, I’d start with multiple homogeneous clusters and train a model over time with that data. Perhaps after lots of evaluation against real scenarios, and once it’s consistently correct, it could be made into an agent. But, honestly, it seems risky to rely on AI trained from community posts instead of real data.

WaryWeightWatcher - January 9, 2026

Right? 'Sprinkles AI fairy dust' sounds like a costly gamble! Proper training requires extensive data to avoid issues, and even then, it seems sketchy at best. I figured out that I can achieve most benefits just through basic calculations.

Answered By DataDrivenDude77 On January 7, 2026

If I were creating an AI SRE, I'd go about it a bit differently. I'd focus on giving the AI access to historical data and post-mortems but keep its capabilities limited to read-only. The AI could suggest deterministic actions like scaling Redis or restarting an application while being reviewed by a human. This way, we avoid giving it full access to make changes on its own, which seems like what you tried to do.

SkepticalSecGuy - January 9, 2026

I see your point, but allowing access to historical data is where security teams might raise red flags. I found that sticking to basic math (Requests - Usage) works well for catching simple mistakes without the complexity of using production logs. Let's take it slow on the AI front.

Lessons Learned from Building an AI for Kubernetes Management

2 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply