I'm using AWS ECS Fargate to run my express Node.js TypeScript web app, with a setup of 2 tasks on a 1vCPU. I've set a scaling alarm to trigger when CPU utilization goes above 40%. However, I've noticed that when there's a spike in traffic, it takes about 3 minutes for the alarm to go into the alarm state, even though multiple data points exceed the threshold. Can anyone explain why this delay happens and what I might do to speed up the scaling process?
2 Answers
The delay you're experiencing is mainly because ECS/Fargate publishes CPU and memory metrics every minute, and CloudWatch takes an additional 1-2 minutes to collect and display those metrics. Here are some tips to make your scaling faster:
- Consider setting the period to 30 seconds for metrics that support it, like ECS CPU via Container Insights.
- Implement Step Scaling with multiple thresholds for quicker scale-outs during traffic increases.
- Try Target Tracking Scaling, which keeps your CPU utilization close to 40% and reacts quickly to load changes without needing to manage alarms constantly.
- Enable Container Insights for quicker and more detailed data, although it may slightly increase your CloudWatch costs.
- If you anticipate traffic surges at certain times, like morning login rushes, consider pre-warming your tasks either manually or through scheduled scaling.
Make sure to check your health check settings for your containers, as they can affect the speed at which new containers get attached to your Application Load Balancer (ALB).
Can you share your health check settings? Knowing the timeout, interval, and threshold will help us understand any delays better.
Thanks for the insights! I do have Container Insights enabled but wasn't using those metrics in my alarm. Is it possible I set it up incorrectly? Your action plan sounds good—switching to container insights metrics for CPU and changing to a 30-second period.