I'm facing an issue where pods in my K8s cluster can't start because AWS ECR lifecycle policies are expiring images. Even though I'm using Pull Through Cache for public images, pods are failing with `ImagePullBackOff`. This is problematic, especially since I have a setup with Istio service mesh and multiple helm charts that rely on these cached images. When an image like `istio/proxyv2` expires, the upstream public image still exists, but ECR isn't pulling it as expected. Manually pulling images has been my temporary fix, but it's not scalable or reliable. I'm wondering what the best practices in the K8s community are for handling this issue while maintaining optimal pod startup times.
3 Answers
One approach you could take is to modify your ECR lifecycle policy to keep a certain number of images instead of just expiring them based on their age. This way, you'd always have at least one version available for pulling. It could help mitigate downtime when pods try to start up.
If your infrastructure remains static, consider running scheduled jobs to pull all required images to each node. Alternatively, you could adjust your policy to keep images based on usage or relevant metrics tailored to your application needs.
That sounds interesting! I suppose the metric-based approach could involve using resource tags. I'll definitely do some more research on that.
It sounds like this issue might be more about your ECR policies than Kubernetes itself. You could look into fixing the policies since they seem to be the trigger for the pod failures. Reducing the number of expired images would be a good start.
Absolutely, but I'd still prefer to keep the ECR lifecycle policy as is to manage costs.
Does AWS ECR allow that option? I'm using the latest tag for images.