I'm in a bit of a bind with my Hyper-V failover cluster setup, composed of three hosts and connected to a PowerStore appliance via iSCSI. This PowerStore appliance provides two logical units: one for shared VM storage and a 50GB disk witness. Everything appears to be configured correctly with redundant paths for MPIO and switches. However, we recently faced an unprecedented situation where both switches went down for 30 minutes. During that outage, the VMs lost storage access, which is expected. But once the connections were restored, things didn't go back to normal. The LUNs were visible to the hosts, but they remained offline. I attempted to partially start the cluster, but the Cluster Name Object (CNO) was unreachable, preventing me from managing the cluster effectively. This isn't the first time it's happened; we had a similar failure previously that required us to rebuild the cluster manually. I'm trying to understand if this is a known sensitivity issue with Hyper-V or if there's something wrong with our cluster setup that's causing it to not recover automatically after the iSCSI restoration. Additionally, should we consider switching to a file share witness instead of continuing with a disk witness? I'm also contemplating whether moving to Hyper-Converged Infrastructure (HCI) is a better option since the ongoing troubles with iSCSI are becoming a concern, but that would also leave our PowerStore appliance underutilized due to budget constraints.
3 Answers
To get things back on track, have you tried testing the paths between each host? Running pings to each controller might reveal issues. Also, are you using jumbo frames? If the connection drops for too long, VMs can freeze and may require a power reset.
Just a side note, you mentioned having three hosts plus a disk witness? That's four votes in total. Usually, the standard is to have an odd number of votes to avoid split-brain scenarios. That might be something to consider in your overall setup.
Good point! I’m still figuring Hyper-V out, so I'm not sure what the ideal configuration should be. Appreciate the insight!
It sounds like your DC might be having an issue being part of the cluster, which can slow down recovery. Having a separate domain controller can help manage outages like what you experienced. Make sure your DNS and DHCP are functioning well; those are crucial for recovery operations related to your storage access. If your cluster role isn't being restored, it might be due to the CNO not being reachable. You might need to check that everything is up and running from that perspective.
Yeah, I have a physical DC and some VMs on the VMware cluster that can help. Even with that, it seems like the cluster isn't able to reach its own CNO, which is frustrating.

I can ping all controllers from each host without issues. Jumbo frames are on, but there’s some MTU size variance between the PowerStore and the hosts that has me worried. Could that mismatch really cause issues?