Hi everyone,
I currently have a functioning stretch cluster with three nodes—two on the primary site and one on the secondary site—using a file share quorum. It works well under normal conditions and during simulated outages; I can move VMs and access the CSV volume without issues. However, when I experience a complete failure of the primary site, all services stop on the primary, and the secondary node remains operational, but nothing shifts over automatically.
I find myself having to manually restart the cluster service and perform other operations like using `Set-SRPartnership` to restore functionality. The process has been quite inconsistent, which leads me to believe there's something I'm missing. I've also looked into Microsoft documentation but haven't found clear guidance on recovering from such a crash at the primary site.
My understanding is that it should ideally handle this automatically in synchronous mode, but it doesn't. Has anyone experienced similar issues and found a reliable way to get the cluster back up after a total crash at the primary site? Any insights would be appreciated! Thanks!
2 Answers
From what you're describing, it does seem like the issue could be tied to how the votes are assigned during a failure. When both primary nodes go down, you're left with no quorum despite having the secondary node and a quorum mechanism. Consider testing with that fourth node like suggested, as it may resolve the quorums' counting problems when a primary site outage happens.
It sounds like you might need to rethink your node setup. For a three-node cluster, at least two nodes must be operational for quorum, so when your primary site's nodes fail, you effectively lose quorum. One option is to add a fourth node at your secondary site to ensure that you maintain a majority if one site goes down. Also, utilizing a cloud witness can really help maintain that quorum stability between sites.
Adding a fourth node makes sense. I had a similar setup, and once I included an additional node and a cloud witness, it helped stabilize everything. It’s crucial that more than half the nodes are up so that the cluster can maintain quorum.

Definitely testing with a fourth node! I think that might just be the ticket to consistency. Thanks for the advice!