I'm currently managing around 1,000 pods and finding manual monitoring to be unsustainable. I'm looking to create an observability solution that involves using K8sGPT as a CronJob to analyze cluster health and push insights to Slack. The goal is for the AI to identify issues without taking actions, send clear summaries to Slack, update Confluence with relevant runbooks, and optimize costs by not running in real-time. However, I'm facing some challenges:
1. How do I effectively monitor the 'state' in Kubernetes with all the dynamic scaling and restarting?
2. Are there any existing implementations of Managed Control Plans (MCP) for K8sGPT? I've heard it can host MCPs, but I struggle to find good examples.
3. What are the best practices for AI-assisted monitoring that provides useful insights, like "15 pods OOMKilled in namespace-X," rather than just automating deployments?
I'm currently using Prometheus and Grafana, but I need better intelligent filtering rather than just adding more dashboards. Has anyone created something similar, and do you have any architectural advice for scaling this solution?
5 Answers
It sounds like your main issue is the manual aspect. Adding Kube State Metrics and AlertManager into the mix will likely resolve your need for insights without tacking on AI.
I've dealt with similar challenges, and here are my thoughts: First, let AI assist in reviewing your existing dashboards and alerts. Give it access to tools like Grafana and kubectl, and it can help optimize your setup. Second, implement a CronJob to send you summaries whenever alerts trigger or at regular intervals. Make sure this setup can connect to Confluence for documentation purposes. I've done both methods for different teams, and they can be a bit noisy at first, but they streamline monitoring significantly.
You really don’t need to throw AI on every issue, especially not for infrastructure management. There's a lot of foundational work to do before jumping to complex solutions like AI.
I think AI has potential! It can bring new capabilities and insights. Don’t dismiss it just yet. I'm all in for using AI wherever possible.
I think you might be overcomplicating things with AI. Point 3 kind of shows that you just need better alerting, not some fancy AI setup. A solid Grafana and Prometheus configuration can give you the insights you need without adding more tech debt to your environment. Focus on setting up proper monitoring first instead of worrying about what AI can do since that can become a headache.
Why not just stick to traditional monitoring? Adding AI might complicate things more than necessary here.

This feels a bit excessive for the need at hand, though.