I recently ran a disaster recovery (DR) test and discovered a major oversight in our recovery plan: it completely relies on Entra ID being available. During the test, we failed over to our backup data center successfully, but we couldn't log in to any of our applications since Entra, which handles our authentication, was still in the primary region. Our DR runbook didn't account for identity management because we didn't think it would fail. Entra doesn't automatically switch over with our DR efforts, and if there's an outage at Microsoft's end, we're unable to authenticate to our backup apps, which defeats the purpose of our DR plan. We even locked ourselves out because our backup admin passwords need access to Entra, and while break-glass accounts exist, they don't help us with the actual applications. The dilemma lies in the fact that we need a reliable solution since Entra doesn't have a controllable DR mode, and going multi-region would require separate tenants, complicating everything. I'm wondering how others manage identity when your IdP is a cloud service that you can't failover yourself.
5 Answers
It sounds like you’re mixing up a few concepts here. Entra ID is geo-redundant, so it shouldn’t be thought of as a single point of failure. The real issue seems more about your apps being unable to reach Entra due to network connectivity problems rather than a failure of Entra itself. Cached credential tokens can actually stay valid for several hours without needing to reach Entra, so usually, users won’t feel a brief outage. The more pressing concern is that your break-glass vault relies on Entra; that's definitely a design flaw. Consider having an offline vault, like a USB KeePass database or a physical safe, for your DR admin credentials that aren't tied to any cloud dependencies.
You already have the solution in mind: Local Active Directory (AD). If you don’t trust it, you might want to take steps to make it reliable. It’s a cost-effective solution since it’s already in place.
We assume that any DR plan that requires a complete rebuild of our M365 tenant means we already have one to recover into. If we don’t, well, it's likely that everyone else is facing the same issue, so it's not solely on us.
Remember that incident with Facebook? They couldn't get into their servers because everything depended on DNS, which they broke. They need physical access to fix their servers, so planning for this kind of physical access is crucial.
The likelihood of Entra failing simultaneously with your local infrastructure is so slim it's really not worth overhauling everything for. Conduct a thorough risk analysis for both environments separately. Also, be prepared for risks like ransomware which could affect everything, as you'll have to accept those realities and possibly pay hackers at some point.
If Entra completely fails, a lot of IT people everywhere have much bigger problems on their hands.

It will be an interesting day when Entra goes down globally even for just a few minutes.