I'm currently juggling numerous Helm charts and Terraform configurations spread out across various repositories. The inconsistency in validation processes is overwhelming—some repositories have pre-commit hooks while many do not, and while some run validation in CI, others simply deploy directly to production. Recently, I discovered a manifest with an outdated container image that had been ignored for months because the specific repository wasn't being checked. I tried using a spreadsheet to track everything, but that's quickly falling apart. I'm looking for effective strategies to validate Infrastructure as Code at scale without it feeling like a full-time job. What are other teams doing to tackle this issue sustainably?
5 Answers
Spreadsheets for tracking Infrastructure as Code? That's pretty much the first sign that you've got a problem on your hands. Good luck with that!
Consider using admission controllers like OPA Gatekeeper. They can block problematic manifests during the apply phase, regardless of what passed in CI. While it won't catch everything, it definitely helps to prevent unpatched images from making it to your clusters. You should still implement some kind of drift detection for better accountability.
You might want to explore GitOps as a solution. It could bring some consistency to your validation processes across the scattered repositories.
The scattered validation you're experiencing highlights the issue that many platforms aim to solve. Centralized policy enforcement helps across all your repositories. Tools like Checkmarx can scan Helm charts and Terraform for misconfigurations and vulnerabilities, aligning findings with what's deployed. This way, you can catch unpatched containers based on actual runtime conditions rather than static analyses. Far better than tracking everything manually with spreadsheets!
To tackle IaC sprawl, you need centralized scanning instead of relying on manual tracking. Policy-as-code tools can enforce validation regardless of your repo setups. Checkmarx can automate security scans, catching outdated images and misconfigurations before deployment, saving you from the headache of reactive fixes later on.

I've got an Ansible playbook set up to help you migrate that Excel spreadsheet over to BigQuery or Snowflake!