Rethinking AWS Disaster Recovery Strategy After the Recent Outage

0
20
Asked By TechBard42 On

I've been working with our multi-region deployment, and during the recent outage, I realized that our automated health endpoint switchovers didn't function as intended. For instance, our EventBridge global endpoint did switch to the secondary, but our Fargate health endpoint failed to do so. We received alerts regarding increased error rates, which prompted us to switch manually. My current approach for disaster recovery (DR) involves switching all services to the secondary region if any one service fails, which I thought was a safer strategy. However, I'm only actively monitoring Fargate and not all services like DynamoDB (DDB). Now I'm reconsidering whether I should monitor each service proactively rather than waiting for reactive alerts. Plus, I don't need an active-active setup—just a pilot light warm standby. What do you all think about this strategy? Should I be monitoring every service or is my current plan sufficient?

4 Answers

Answered By CloudyDays88 On

It's usually simpler to switch the entire solution to a backup rather than analyze which specific service might be down during an outage. However, high availability (HA) and disaster recovery (DR) can be tricky, so it's important to find a strategy that works for your unique setup.

Answered By DevOpsFanatic22 On

Make sure to follow best practices like relying on regional service endpoints. Just a heads up that some control planes, like IAM and CloudFront, are only available in certain regions, so plan your DR strategy accordingly.

Answered By DataNinja77 On

Your approach is reasonable, but were you actually impacted during the outage? A lot of outages happen in the backplane. If that's the case, you might want to reconsider which services you include in your failover plan. It's all about understanding the level of risk you're willing to take.

Answered By CrisisManager99 On

Thinking ahead is key! DR should be part of your design from the start, especially for critical applications. Many companies only think about it after a disaster happens, which isn't ideal. Regular DR exercises can really help you perfect your strategy as part of your overall system management.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.