Hey folks! I recently had a bit of a scare with a data processing script on one of my servers. It hung up for hours due to a slow external API, but it never actually errored out. So, it just kept consuming resources until someone noticed it was stuck. A basic OK/FAIL check from a tool like Healthchecks.io wouldn't have picked this up since the script didn't fail outright. I'm curious, how do you monitor for situations like this? Here are some thoughts I have: Do you write custom wrapper scripts to manage execution time? Is there a built-in feature in tools you use (like Cronitor) that helps with this? Or do you send metrics to Prometheus and set up alerts based on that?
6 Answers
Just prefix your command with the `timeout` command, and you'll be set! This way, if it hangs, it’ll get killed after the specified time.
Honestly, the best way to handle this is by fixing your script and making sure it includes timeouts.
I keep an eye on the build queue in Jenkins. If it isn’t empty for more than 5 minutes, I get an alert. When I do, I check the average runtime of the last 50 runs, double that time, and run my command with a timeout using `timeout -v 300 $command`. This ensures it gets killed if it exceeds the time, which then triggers another notification. Depending on how often this happens, we might either ignore it, extend the timeout, or try to fix the underlying issue.
You might want to consider switching from cron to a more capable tool like Apache Airflow or Control-M. These options can provide you with better monitoring features, although they can be expensive.
Healthchecks can actually catch this situation if you send a start signal. If your script starts running and the grace period expires without a success signal being sent, Healthchecks will alert you. Check out their documentation for more!
I personally use Healthchecks.io on my self-hosted setup. I rely on a 'Late' notification to keep track of my mirror scripts for a large public Linux distribution. It’s been a game-changer for maintaining consistency!
Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures