I'm trying to get a clearer idea of what happens during on-call operations, especially regarding deployments, rollbacks, and handling incidents. I'm looking for insights from anyone involved with deployments, monitoring uptime, or on-call duties. Specifically, I'd like to know: 1. What occurs step-by-step when a deployment fails? 2. Who typically decides to roll back a deployment, and how quickly does it happen? 3. What tools do you rely on during an incident? 4. What parts of this process tend to be the most stressful or prone to errors? 5. What if the main on-call person can't be reached? 6. Is there anything you wish could be automated but isn't, and why? 7. What tasks would you never trust automation to handle? 8. How often do bad deployments impact customers? Thank you for sharing your experiences!
4 Answers
Great question! It varies by organization, but typically if a big issue arises post-deployment, you need to rollback to a stable version immediately, which can usually be done quickly if you’re well set up. We conduct post-mortems to avoid future issues and fill out Root Cause Analysis reports to keep track of what went wrong and learn from it. As for customer impact, when incidents happen, it’s usually related to unseen data inconsistencies that pop up from time to time, not the deployment process itself.
When a deployment fails, the steps are crucial: we validate before moving to production, start small with a limited rollout, maybe to just a region, and monitor performance closely. Alerts should help us catch issues early. If rollback is needed, the service owner generally makes that call quickly, especially if it’s stateless. During incidents, we use monitoring tools and logs to keep track, and manual processes can indeed be a pain if things go awry.
Totally agree! Manual processes can spiral out of control. Automation is a must, especially with ever-changing systems.
Happy New Year! To give you some context from my experience at a bank, when a deployment goes south, we usually have a Post Implementation Verification process that requires sign-off from the business owner if it affects customers. The tech teams can call off or roll back the deployment if there's a technical issue, provided they're within the change window. For non-customer-facing systems, the decision is often up to the implementers, but again, we need approval if we’re breaching our window. Most of our deployments are on OpenShift, which lets us roll back container versions pretty smoothly with automation, unless there’s a database update involved.
That's interesting! I assume during a significant change, you have a system in place to monitor everything closely, right?
I get why late-night deployments are a thing, but I’d much rather deploy during business hours. That way, if something goes wrong, I’m already at my desk and not getting woken up in the middle of the night. Plus, tired people make mistakes—deploying at ungodly hours just increases the chances of errors.
For sure! I can’t count the number of times I’ve been dragged out of bed to fix something that could have waited until morning.

I’ve noticed that too — the real issues often come from data nuances rather than the automation failing.