I've just started a new job in healthcare IT, and I noticed the team is manually monitoring over five servers every 30 minutes. They send emails to management with screenshots from one or two of these servers, and they even log in manually to two of them to check if they're working. It's really exhausting! They also monitor two other servers via Grafana but still have to email reports. I'm looking to streamline this process by automating the Grafana aspect first. I'm curious about the best ways to automate checking login status, load status, and URL status to send emails only when there's an issue. Any suggestions?
5 Answers
Before automating, it's a good idea to find out why the team is sending reports every 30 minutes. There might be a good reason behind it, like a past issue that caused them to be cautious. Understanding the 'why' can help you build a more effective automation system.
If you're on a budget, you might want to look into Zabbix or Checkmk. They're open source and can integrate well with Grafana. Just know that Zabbix has a bit of a learning curve, but it has great dashboard capabilities once you get the hang of it.
Honestly, you can automate pretty much everything you're doing manually right now. For instance, you could use scripting in a common programming language to automate checking the servers every 30 minutes. It might be worth looking into tools like PRTG or Zabbix for this. They can provide alerts and even create reports without needing all that manual input.
If you're still taking screenshots, consider using a headless browser like Chrome or Edge to automate the screenshot process. You might need a solution like Playwright or Puppeteer to log in and take screenshots automatically, which can really save you time. Just be careful not to send empty or incorrect data to management.
Have you thought about just using Grafana's alerting features? It seems odd that you're still doing manual checks since you're already set up with Grafana. Just add all your servers there and set up alerts to notify you when there are issues instead of emailing screenshots.
Yeah, they are added, but we're having backend issues that prevent it from creating tickets when thresholds are crossed, so we still have to watch them closely.