I'm part of a small mechanical engineering start-up with just five people, and we handle a lot of our IT ourselves. We've got a mix of in-house servers, colocated servers, rented VPS, dedicated servers, and some cloud stuff at AWS as well. Right now, we're using Prometheus and Grafana on one of our servers for monitoring, but honestly, it's starting to feel a bit chaotic, especially with everything else we have going on: storage servers, database servers, internal applications, and two critical web servers we need to keep online 24/7. We're also currently using Uptime Kuma but we're not entirely happy with it. We know we need to clean things up eventually and hire a sysadmin when the budget allows. For now, we're looking for your recommendations on: 1) Should we stick with Prometheus, Grafana, and Uptime Kuma, or is there a better solution for us? 2) Any good resources or courses on infrastructure monitoring? 3) Best practices we should follow?
5 Answers
Check out acumenlogs.com! They offer 10 free uptime monitors, synthetic monitoring, heartbeat checks, and more. It might provide some relief to your current monitoring challenges.
Sticking with Prometheus is a solid move, but if you're open to exploring, Zabbix is an excellent open-source alternative. I've been diving into it recently and it's quite powerful for monitoring needs. Just be prepared for a bit of a learning curve.
You might want to consider dropping Uptime Kuma and just focus on Prometheus. It covers more ground and will likely serve you better. Your main issue seems to be the complexity of managing all those servers. Have you thought about automating the process? Cleaning up the current system will help more than just switching tools. Also, check out these best practices: 1) Monitoring Distributed Systems, 2) Practical Alerting, and 3) the RED Method.
It really depends on what you need. Zabbix is a fantastic, free, open-source option that does a great job, but it does take some time to learn it well. Definitely worth considering for your monitoring setup!
Prometheus and Grafana are solid choices for your needs, but keeping it organized requires effort. Make sure to monitor everything you have enough space for, and avoid alerting on every little thing. Focus on actionable metrics related to user experience, not just CPU usage or uptime. Latency monitoring could really help you pinpoint capacity issues as they arise. Plus, consider security and audit logging depending on your niche; it can be a full-time job.
Thanks for those links! And yes, we know we need to clean up the mess!