Hey folks! I've been working as a DevOps engineer for around 5-6 years across two companies, and my main job has been to develop auto-remediation or self-healing scripts. These scripts kick in when monitoring tools such as Grafana or Datadog detect issues, like CPU spikes or low disk space. Up to now, I've been using Python, Go, and Shell for scripting, paired with tools like Rundeck, Jenkins, and n8n for orchestration. My question is, does anyone know of a more specialized tool that can automatically act when a monitoring metric exceeds a threshold? I'm looking for something more dedicated to this task.
4 Answers
I’d suggest using Webhooks with an orchestrator. It's often better to stick with general solutions rather than something built just for this specific purpose.
You might want to check out Monit. It's an older tool, but it does the job well. Just a heads up, though; it can sometimes mask serious issues instead of fixing them. It’s more about managing symptoms.
That's a valid point! I mainly want it for maintenance tasks, and it’s nice to have that safety net. I’ve used Monit before, but only on a smaller scale—never figured out how to scale it without chaos.
I've been looking into tekton.dev and Argo workflows paired with Argo events for triggering tasks. But it really depends on the kind of infrastructure you're managing with your scripts.
Isn’t tekton mostly for building and deploying apps? I haven’t used it, so I’m not sure. It kind of feels like a Jenkins alternative to me. My infrastructure involves around 1300 virtual hosts, mostly CentOS variants and some outdated CentOS 6, plus Windows servers.
I've heard great things about StackStorm, but honestly, I’m reluctant to install yet another job runner. It’s supposed to be solid, but maintenance can be a hassle.
StackStorm is impressive but definitely has its headaches. It’s rock solid for automating tasks, but when I last checked, the Kubernetes deployment wasn’t production-ready, so we just ran it on a powerful VM.
Never knew about it before. It sounds more like an add-on for monitoring tools rather than a standalone solution, but I'm interested to give it a try!
So, would that mean having a library of pre-made scripts that are tried and tested?