I'm curious about how to manage a situation where an entire Availability Zone (AZ) experiences a network outage. I'm using an Application Load Balancer (ALB), and my Route 53 alias points to this ALB, which returns IP addresses for multiple AZs. If my client doesn't implement circuit breaking or retries, will it keep failing on the inactive leg of the ALB until the client TTL expires? Then, there's a chance it could receive the same broken address when the TTL expires since the ALB won't update Route 53 dynamically. Are there any strategies to address this issue? Also, I believe the 'Evaluate Target Health' option on an Alias won't help here, given that it checks backend target health and not the ALB itself.
3 Answers
Yes, you're correct. The health checks can take a bit of time to relay failure information downstream, so clients should definitely have a retry mechanism and fallback plan in place if you're looking for fault tolerance.
Definitely check out Chaos Engineering along with the AWS Fault Injection Simulator. There's a workshop that dives into AZ disruptions and other failure scenarios. It could be helpful for your case!
Good news! There's a new feature that supports zonal shifting with cross-zone enabled ALBs now. It might be worth checking out for your situation!
I feel you there, but most examples focus on backend services. My concern is more about the ALB connection failing. I haven't really found a way to simulate that without triggering a DNS update, which is precisely what I want to avoid.