I've encountered various production issues stemming from environment variables, like missing keys, incorrect formats, and even production using development values. Sometimes everything seems fine until deployment, and that's where things go wrong. I'm curious about how other teams prevent these environment and configuration failures. Do you perform validations in CI, or do you rely on conventions and documentation?
5 Answers
I once tried changing the network settings in my Ceph cluster without following the proper steps. It completely locked the cluster, which caused the Proxmox cluster depending on it to go offline. We ended up spending two days and calling support to get everything running again!
Back in the day when Netware 4.0 first launched, it was really fragile. I made the mistake of dragging the icon for the main drive array, which disconnected the entire drive. The whole network crashed, and after rebooting didn't work, I had to rebuild the server from scratch. They actually fixed that in later versions.
I once caused a significant outage in the UK environment because of a simple configuration mistake. A fat finger literally took down about 200 devices—it was a real wake-up call for me.
A colleague messed up when syncing the ArgoCD gateway application by using "force" and "replace" options. This resulted in a broken gateway, leading me to uninstall everything, including Karpenter. I think there was some kind of desynchronization between Karpenter nodes and the load balancer, which made it even more complicated. I had to reinstall the entire cluster from scratch.
In my experience, it's all about double-checking (or even triple-checking) configurations. Testing similar setups in different environments is also essential. Plus, it's important that all the people doing these checks truly understand the configurations and their potential impact.

Oh man, I've also faced issues with Ceph. We had a cluster failure during rebalancing because someone mistakenly used bcache devices for the OSDs. They seemed fast at first, but when rebalancing kicked in, everything slowed to a crawl, which was a nightmare with 300 VMs relying on that data!