I've been running a series of small ECS tasks, but I've run into issues with monitoring their startup status. While I have CloudWatch alarms set up for errors, I'm struggling to monitor when a service fails to start altogether. I've set up container insights for monitoring the RunningTaskCount, but the costs are pretty high given the large number of small Fargate instances I use. I can't seem to filter down these metrics to reduce costs, and ECS health checks also seem to require container insights to be effective. I'm looking for a way to be notified if my tasks aren't running properly without incurring heavy costs. Any suggestions?
6 Answers
Every task sends a datapoint for metrics like CPUUtilization at the service level. You can use the SampleCount of those metrics as a simple check for how many tasks are running. It's not perfect, but it can be a quick proxy.
I set up an event bridge that listens for ECS changes and routes those events to a Lambda function. The Lambda checks the exit codes of the tasks and sends me an email if something's not right. It's been working well for monitoring unexpected shutdowns!
If your service has a target group, you might want to monitor the healthy host count. It could give you insights into whether tasks are starting correctly or not.
We actually created a separate cluster and service with only one task. When we deploy changes, we first test it there before pushing it to our main production cluster. This way, we can spot startup issues in isolation and deal with them accordingly.
You should make sure your health checks are effective, but also remember that autoscaling can help manage this. If you're going to alarm, consider how you want to act on those alarms—if you’re just going to replace a failing task, you might not need to get too bogged down in the details.
Don't underestimate the power of logging! You could log when your tasks wake up, and then create an alarm based on the expected number of logs. This can help you identify when tasks fail to start or behave unexpectedly without needing to rely solely on metrics.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures