I want to know if Route 53 ever dipped below its guaranteed 100% SLA when it comes to responsibility and fault. Specifically, if a service was designed with a multi-region architecture, would it still have continued to function properly during an outage?
4 Answers
The main issue wasn't with Route 53's availability as a service itself. Instead, it seems like the DNS record was deleted—likely due to some automatic scaling or availability process. In this case, the customer (which was AWS here) is accountable for maintaining their own DNS records, so those changes wouldn’t fall under Route 53's SLA.
There are two aspects to consider. First, Route 53 hosted zones are distributed across multiple servers, so records typically don’t fail to resolve. However, the control plane for Route 53 is primarily based in US East 1. If there's an issue in that region, it blocks AWS from making any changes to Route 53, affecting all regions. They did try to introduce a service called Route 53 ARC, but it feels like a complicated and not fully fleshed-out solution. They might get around to providing multi-region high availability for the control plane one day, but who knows when that’ll kick off?
Actually, the data plane, which allows for DNS queries, continued to operate smoothly. That's where Route 53 maintains its 100% SLA. The architecture for data querying is pretty robust, featuring numerous independent Points of Presence (POPs) worldwide that serve DNS.
Did anyone check the AWS Health Dashboard? This issue wasn’t related to Route 53 at all.

Just a heads up—it's not entirely true that the control plane is only in US East 1. There's actually an internal-only endpoint available in eu-west-1.