Last Friday, our payment gateway experienced a brief timeout lasting about two minutes, which prevented customers from completing checkouts. Engineering classified the incident as a P3, considering it a known issue with our third-party provider that self-resolved without code changes. However, finance completely flipped out, labeling it a P1 because we lost significant revenue during the peak shopping period of Black Friday weekend. Customers who encountered errors abandoned their carts and didn't return to purchase.
Support sided with finance as they were inundated with tickets, and many customers threatened chargebacks on social media. On the other hand, the product team leaned towards engineering's perspective, arguing that the system performed as intended— the timeout and retry logic functioned correctly.
During the postmortem, the team spent more time debating the severity level than discussing ways to improve our payment processing reliability. Finance insists that anything related to payments must be classified as a P1 by default, while engineering argues that this approach undermines severity scales. It's a tricky situation where both sides have valid points: technically, it was a minor incident, but financially, it was a major loss. I'm reaching out to the fintech and eCommerce community for advice on handling scenarios like this.
4 Answers
Maybe you need a better system for classifying these incidents. In my experience, any outage that affects revenue generation should automatically be rated higher, regardless of how quickly it resolves. It’s not just about technical perfection.
For sure, anything impacting revenue directly should be treated as a P1. If your payment gateway goes down, even for a couple of minutes, that means lost sales and a bad experience for customers. Who cares if it was a known issue—every second counts during peak shopping times!
Exactly! It's not just about tech; it's about business impact. We need to keep customers happy and returning.
Totally agree! Loss of revenue should always drive our priorities, especially in critical times like Black Friday.
The distinction between severity and priority can be murky. Engineers may view it as P3 since it resolved without immediate action, but from a business standpoint, anytime we can't process payments should be a P1. Customers don't care about our backend logic when they can't check out!
Right! It’s about user experience and trust. We need to safeguard against this happening again.
I think if you can quantify the financial impact, it justifies considering this a P1, even if it was resolved quickly. Maybe look into switching to a more reliable payment provider or at least having a backup option. Those numbers add weight to your argument.
Exactly! A secondary payment gateway could prevent this from happening again, and it might be worth the investment given the losses.

Definitely! Clear definitions based on business impact would help settle these debates before they start.