We've been experiencing a lot of incidents at my company, mostly related to developer changes that don't seem to be significant errors. I'm interested in hearing what strategies or practices your companies have implemented to effectively reduce incidents, particularly those that are tricky to pinpoint or diagnose.
7 Answers
Some companies reduce incidents by limiting the number of changes released at once. Fewer changes lead to less potential for problems. It's like saying there wouldn't be a multi-car pileup if only one car is on the road at a time, which many companies do to create a false sense of safety by stretching issues over a longer time.
It's all about investing in automation, testing, continuous integration (CI), and continuous deployment (CD). These tools help catch issues before they cause problems in production.
1. Foster a strong postmortem culture to prevent repeating mistakes. 2. Prioritize and track action items from these postmortems. 3. Enhance observability for quicker detection, focusing on symptom-based alerting and SLO monitoring. 4. Refine the release process using canaries or blue/green deployments, ideally coupled with effective observability. 5. Ensure any risky changes are flagged before rollout, come with rollback instructions, and maintain proper observability. It's crucial to build a team that values reliability and prioritizes it over time; big changes often revert back to old habits.
We find that more management and formal procedures, like exclusively communicating through tickets, help streamline processes. We're still on the path to seeing improvements, though.
Addressing the root cause is key. Conducting retrospectives after incidents can reveal permanent fixes, whether that means creating a new testing environment, adopting new practices, or performing load testing. However, without management support for necessary resources, progress can stall.
Keep it simple! KISS—"Keep It Simple, Stupid"—is a great philosophy. Remember that while infrastructure might be smooth, sometimes the software isn't.
It's essential for software engineers to own their code in production and be on call. This creates accountability and can lead to a stronger focus on reliability.

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically