I'm curious about how AWS handles incidents, particularly when something like DynamoDB has an outage. Specifically, I'm wondering how IAM operates in regions like US East 1 and why it seems to only run there. Is it a known issue that doesn't have regional backups? Furthermore, how common are these types of outages across different cloud platforms?
4 Answers
For more insight, check out this video by Fireship about the recent outages. It's got great coverage on how these things unfold in AWS!
DynamoDB's issues were indeed related to DNS problems, which triggered a chain reaction impacting other services. Once AWS identifies the root cause, they investigate and resolve it, but full recovery can take a while—sometimes several hours. My team switched over to US East 2 as soon as we learned about the outage; we have a system that automates that failover!
Outages like this aren't uncommon. When something fails at such a core service, it can lead to widespread issues. The processes AWS has for recovery involve on-call engineers jumping in to handle incidents quickly and effectively. Think of them like "first responders" for AWS failures!
US East 1 is a unique area in AWS's infrastructure. Basically, IAM only operates in one region per partition, meaning if it goes down, it affects the whole partition. That's what happened recently when DynamoDB had DNS issues, causing major disruptions across several services, including IAM. It's pretty much a big deal when that happens!
That's interesting! So, does that mean that even if IAM fails, existing resources can still function normally if they don't need to change with IAM?
Yeah, that’s right! Each region has its own read cache, which can keep things running even during an IAM outage as long as what you're doing doesn't require changing IAM settings.

I saw that! Great resource if you want to dive deeper into AWS's response process.