I just started a new job and inherited a Kubernetes cluster that's running more than 200 container images, but there's no documentation or source information for any of them. Some of these images are pulling from Docker Hub without version tags and others are linked to personal registries that are no longer active. I've discovered that many of the base images haven't been updated in over a year. When I asked the previous team how they managed to vet these images before deployment, their answer was essentially that they didn't. I've spent the last two weeks trying to catalog what's running and trace where these images come from, but half of them are a mystery. I'm looking for advice on how to tackle this chaotic situation. Thanks in advance!
7 Answers
Consider leveraging automated scanning paired with automatic rotation. For instance, we’ve integrated our scanner with HashiCorp Vault, so when a secret is found, it gets revoked and a new one is issued automatically. It took quite a bit of setup, but now everything runs smoothly without secret-related incidents!
Let’s be real—how much do you need to know about what each application does? If it’s not part of your job, I’d focus on keeping the cluster running and maybe report the situation to IT security. You shouldn't be bearing all this responsibility on your own!
A solid first step would be to implement monitoring and logging. This will help you see which pods are receiving network traffic and help trace them back. Focus on pods that are heavy on workload or showing high CPU usage without traffic; they may be background tasks. Another trick is to start disabling deployments one by one and see who complains!
Document your findings and escalate the issue appropriately. Depending on what you discover, there may be significant risks concerning business continuity. Outline the application endpoints that are active, make a report, and send it up the chain. You might also want to prepare a plan for a new cluster to migrate these workloads to.
It’s pretty common to stumble into situations like this, especially if the organization isn’t highly regulated. Try reaching out to all developers and ask them to identify their containers so you can tag them neatly. Make sure to communicate that you’ll be removing any unclaimed pods by the end of the month. It’s a good way to start cleaning up the mess!
Start with automated vulnerability scanning tools like Trivy, which can help you prioritize security issues. Although, it might be wise to ensure the cluster is stable before diving into security updates. Evaluate which workloads are actually being used; sometimes cleaning up unused things is way simpler than updating them.
What’s your responsibility scope like? If you're just managing Kubernetes itself, it's probably a good idea to let the app owners tackle their parts. But if you’re also overseeing the deployments, then putting a system in place for organization is crucial. If you’re not empowered to implement such processes, it might not strictly fall on you to resolve these issues. Just keep everything clear about roles and responsibilities!

I wouldn’t say it’s entirely normal even outside regulated industries, but you're right—it's definitely a frequent issue! And yes, even in regulated environments, audits can be quite lacking when it comes to tech. But definitely back up your actions with management, just in case things get messy later.