With so many potential metrics to monitor like uptime, response times, CPU usage, memory, and error rates, I'm really curious about what you all actually track on a daily basis for website and server monitoring. If you had to narrow it down to just a few key metrics that have genuinely helped you identify issues early, what would those be? Additionally, I'd love to hear about any metrics you've stopped tracking because they felt irrelevant or any that sounded crucial but didn't actually help you out in practice. I'm trying to keep things straightforward and focus on what truly matters in a production environment.
3 Answers
It's smart to think from the perspective of alerts and actions rather than just metrics themselves. For example, if you notice a pattern like high CPU usage every night for a few minutes, instead of panicking, you might recognize it's just a scheduled task and nothing to worry about. For a simple health check, a quick ping to a static page or a specific endpoint might give you a solid picture of your overall system health without complicating things too much.
For web servers, I usually focus on average response times and the 99th percentile response times, alongside the number of requests per minute and AppDex scores. If you're looking at databases, metrics like page life expectancy and wait times for different types would be important. This approach covers both performance and reliability.
From a Site Reliability Engineering standpoint, I keep an eye on Latency, Errors, Traffic, and Saturation. Depending on your architecture, like if you're using Redis or Kafka for caching or event streaming, you might need to tailor additional metrics accordingly.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures