I'm currently working with a client who makes extensive use of EC2 spot instances for their ECS clusters to minimize operational costs. We've noticed that high-load applications, processing around 100 HTTP requests per second, are not being drained from the target group quickly enough when a spot instance is terminated. This delay results in HTTP 502 Bad Gateway errors from the Application Load Balancer (ALB). The instances are set up to listen for termination notices to prompt the target group to drain the affected host, yet it seems the timing isn't quite working out.
I've come across a feature called "EC2 Instance Rebalance Recommendation," which I believe serves as an early warning that a spot instance might soon be interrupted due to increasing demand. However, after subscribing to these events in EventBridge, I've noticed that they typically arrive just at or immediately before the termination notice.
Has anyone else experienced this issue? Can someone clarify the connection between the rebalance recommendation and the termination notice? Additionally, I'm curious if there are other AWS tools that might help us manage this situation, as the client is looking to keep costs down and avoid using on-demand or reserved instances.
4 Answers
One thing you might want to try is tweaking the timeout settings for your applications and adjusting the deregistration delay for your target group. The default registration delay is set to 300 seconds while the spot termination notice arrives about 120 seconds before the instance is cut off. It’s also a good idea to look at the longest running requests to ensure your app timeouts are suitable. Just to clarify, the rebalance recommendation is a signal from AWS indicating that your spot instance is at risk of interruption, potentially allowing you to manage your workloads more proactively. You can find more details about this in the AWS docs [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalance-recommendations.html).
We rely heavily on termination event queues, providing ample time before connections are drained. I’ve almost always received the two-minute warning on those. Our setup utilizes the AWS node termination handler for EKS, which builds upon EventBridge messages, similar to what you’re doing. Just a thought—are you sure your events are strictly termination warnings and not just hibernation ones? By the way, are you using ECS on EC2 or Fargate?
From what I've seen with spot instances, you seldom get guaranteed notifications, but typically, I receive rebalance recommendations anywhere from 5 to 30 minutes in advance, plus the regular two-minute interruption warnings. However, the last couple of days have been weird due to AWS difficulties, where I didn't always get the warnings prior to some terminations. I'm keen to hear how your experience with SpotInst compares to the native AWS tools.
Honestly, I haven't used the native AWS tools extensively for this, but SpotInst has really streamlined the process for us in getting terminated instances replaced quickly. The timing of recommendations aligns pretty much with the termination notifications, so we find it somewhat ineffective.
Yes, these are indeed EC2 spot instances! The best approach is to monitor the instance metadata since you typically get that two-minute warning. The rebalance recommendation is more of an autoscaling event, so utilizing lifecycle hooks can be super beneficial here.
Exactly! We're already tracking the termination notice two minutes before any actual cutoff happens, which we use to inform our target group and the load balancer to start spinning up replacements.

We definitely found it tricky to act on those recommendations since they often come too close to the termination notice. We learned the hard way to adjust our deregistration delay from 30 seconds to 70 seconds, which gives us a bit more leeway to shut things down properly.