I'm working with systems that handle irreversible actions like charging cards or confirming bookings, and I've run into issues with retries triggering double commits due to race conditions or other failures. Even when using idempotency keys, I'm facing challenges in situations with concurrent execution attempts, retry storms, process restarts, and partial failures between proposal and commit stages. I'm curious about how others enforce exactly-once semantics at the commit boundary. Are most people relying just on database constraints and idempotency keys, or are there other patterns being used? I'm especially interested in methods that can survive restarts and replay without leaning entirely on application logic. Any concrete solutions or examples from real-world scenarios would be greatly appreciated!
2 Answers
Using idempotency keys alongside atomic database transactions generally provides a solid foundation for preventing double commits, even after restarts. The trick is to ensure that the idempotency key is stored atomically with the business logic, as this establishes strong guarantees. But here’s a thought—when dealing with external systems where side effects may not remain consistent with your commits, do you consider the database your ultimate truth source and reconcile from there, or do you lean on the external system for clarity? Understanding how to manage that boundary is key.
There are definitely ways to tackle these issues, though folks often complicate things unnecessarily. The main challenge here is the 'dead zone.' That gap between the final save and the confirmation is tricky because if something goes wrong, you can’t tell if the transaction went through. Databases use Write-Ahead Logging (WAL) to help minimize this dead zone. One key approach is to avoid retries unless you receive a confirmed failure response. If you aren't sure whether it succeeded, you’re still in that dead zone. So, I suggest:
- Implement recovery by checking the source to see if it's transacted, and then act as if it was successful.
- If you can't query the source, generate an alert or report; someone will need to follow up manually.
Those who try to automate the recovery can end up creating complex systems that fail anyway!

Totally agree—uncertainty about whether the external effect happened is central to the issue. If you get a timeout or crash and the outcome is unknown, that's where duplicates can sneak in. My take is:
- If the external system supports idempotency, retry using a stable key.
- If it allows for state querying, treat unknown outcomes as pending and follow up.
- If neither is an option, then you can’t guarantee exactly-once; it's then down to manual checks or finding a new provider.
I think even with reconciliation, it's crucial to have a single authority for handling recovery to avoid recreating race conditions in that process.