System Operations

How to Set Up AI-Based Monitoring for Kubernetes at Scale?

January 23, 2026

Asked By TechieGizmo42 On January 23, 2026

I'm currently managing around 1,000 pods and finding manual monitoring to be unsustainable. I'm looking to create an observability solution that involves using K8sGPT as a CronJob to analyze cluster health and push insights to Slack. The goal is for the AI to identify issues without taking actions, send clear summaries to Slack, update Confluence with relevant runbooks, and optimize costs by not running in real-time. However, I'm facing some challenges:

1. How do I effectively monitor the 'state' in Kubernetes with all the dynamic scaling and restarting?
2. Are there any existing implementations of Managed Control Plans (MCP) for K8sGPT? I've heard it can host MCPs, but I struggle to find good examples.
3. What are the best practices for AI-assisted monitoring that provides useful insights, like "15 pods OOMKilled in namespace-X," rather than just automating deployments?

I'm currently using Prometheus and Grafana, but I need better intelligent filtering rather than just adding more dashboards. Has anyone created something similar, and do you have any architectural advice for scaling this solution?

5 Answers

Answered By MonitoringPro20 On January 25, 2026

It sounds like your main issue is the manual aspect. Adding Kube State Metrics and AlertManager into the mix will likely resolve your need for insights without tacking on AI.

Answered By InsightGuru29 On January 25, 2026

I've dealt with similar challenges, and here are my thoughts: First, let AI assist in reviewing your existing dashboards and alerts. Give it access to tools like Grafana and kubectl, and it can help optimize your setup. Second, implement a CronJob to send you summaries whenever alerts trigger or at regular intervals. Make sure this setup can connect to Confluence for documentation purposes. I've done both methods for different teams, and they can be a bit noisy at first, but they streamline monitoring significantly.

RealisticAdmin11 - January 25, 2026

This feels a bit excessive for the need at hand, though.

Answered By CautiousCoder93 On January 24, 2026

You really don’t need to throw AI on every issue, especially not for infrastructure management. There's a lot of foundational work to do before jumping to complex solutions like AI.

AIEnthusiast88 - January 25, 2026

I think AI has potential! It can bring new capabilities and insights. Don’t dismiss it just yet. I'm all in for using AI wherever possible.

Answered By SkepticalDev007 On January 24, 2026

I think you might be overcomplicating things with AI. Point 3 kind of shows that you just need better alerting, not some fancy AI setup. A solid Grafana and Prometheus configuration can give you the insights you need without adding more tech debt to your environment. Focus on setting up proper monitoring first instead of worrying about what AI can do since that can become a headache.

Answered By AI_Curious On January 23, 2026

Why not just stick to traditional monitoring? Adding AI might complicate things more than necessary here.

How to Set Up AI-Based Monitoring for Kubernetes at Scale?

5 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply