I'm currently testing a 2-node cluster with Pacemaker, Corosync, and DRBD (Active/Passive). Node 1 is set as the Primary and Node 2 as the Secondary, with Node 1 having a location preference score of 50. Here's the situation:
1. I simulated a failure on Node 1, and the resources successfully transferred to Node 2.
2. While resources were running on Node 2, I started a large file transfer to the DRBD mount point.
3. After a while, I brought Node 1 back online.
4. Pacemaker immediately switched resources back to Node 1, killing the file transfer on Node 2 and leaving me with a corrupted file.
I thought Pacemaker or DRBD would hold off on resource switching until ongoing write operations or synchronization completed. But clearly, that's not the case.
1. Is this behavior typical? Does Pacemaker not consider active user jobs?
2. How can I change the cluster's configuration to keep it on Node 2 until all transfers and syncs are complete? I do need Node 1 to always be the master when it's available.
3. Should I be worried about filesystem corruption from this, or is it just interrupted transactions?
Here's a glimpse of my configuration:
- stonith-enabled=false (I know this isn't safe; I'm just testing)
- default-resource-stickiness=0
- Location Constraint: Resource prefers node1=50
Any advice would be appreciated!
3 Answers
Yep, that's expected behavior. Pacemaker operates without any knowledge of active I/O or user sessions. Since your stickiness is set to 0, it will immediately return resources to Node 1 as soon as it comes online. DRBD doesn't delay promotions just because there's an active write on the Secondary node—it's all about satisfying the constraints. To fix this, consider increasing your stickiness value to 100 or more, which will help keep the resources on Node 2 longer. Using a ban constraint that only gets lifted when you either clear it manually or after a full DRBD sync can also help. Don't rely solely on a 'prefer=50' type constraint for master selection; you should look into DRBD Master/Slave rules or manual promotions. And definitely enable STONITH to avoid potential split-brain situations. There shouldn't be filesystem corruption unless DRBD gets compromised, just interrupted writes which can lead to data loss, but nothing corrupted. If a node fails without STONITH, that’s when real corruption can occur!
Thanks for the confirmation! I’ll implement these changes.
If possible, consider using 3 nodes for your Pacemaker setup. Having an extra node ensures that when one node drops, the others can communicate, providing better reliability. This way, the two nodes left knowing they aren't the problem can continue to operate smoothly. Everything that was mentioned in the previous response holds true, but added resilience can be very beneficial.
Great suggestion! I often recommend adding nodes for better quorum.
You might want to check out Linstor as well. It can provide better orchestration capabilities for your DRBD setup, but just know if you're already having issues with DRBD, Linstor won't solve them itself.
True! It's good for orchestration, but not a fix if your base DRBD setup is flawed.

I can confirm this! I've set my resource stickiness to 1000 to avoid the issues you're facing.