How can I prevent Karpenter from killing my pod during consolidation?

0
20
Asked By TechWhiz42 On

I have a long-running deployment, called Service X, that operates during the evenings for a scheduled event. During off-hours, the load on the cluster decreases significantly, prompting Karpenter to aggressively consolidate resources, which ends up removing nodes and consolidating pods onto fewer instances. The issue arises when Service X gets rescheduled during this consolidation, which takes about 2 to 3 minutes to be ready again. In that downtime, if another service attempts to fetch data from Service X, it causes a noticeable outage. I'm considering a couple of options like running Service X on a dedicated node or marking the pod as non-disruptable to prevent eviction. However, both solutions feel too heavy-handed or could drive up costs. Is there a more cost-effective way to manage this issue, given the long startup time, intermittent traffic, and Karpenter's aggressive node consolidation, without locking capacity or completely disabling consolidation?

3 Answers

Answered By CloudGuru89 On

Have you thought about utilizing a pod disruption budget (PDB)? This allows you to control how many pods can be disrupted during events like consolidation. You can find more info about it on Karpenter's documentation. It could help ensure that at least one instance of your service remains available while pods are being rescheduled.

TechWhiz42 -

Yeah, I did consider that option!

Answered By DevOpsDude88 On

Why not have your service running with two pods that are set to always run on separate nodes? This, combined with a PDB, could prevent any disruptions. It seems like a straightforward fix to me.

TechWhiz42 -

I had the same thought and pushed the developer for it. But due to tight deadlines, they managed to implement neither the uniqueness nor duplicates for the pods. Now they’re asking me to find another solution, with the last option being a dedicated instance until they can sort it out.

Answered By K8sMaster101 On

If it's feasible, consider scaling up your deployment with multiple instances. You might want to implement a PDB with a minimum of 1 available pod and run two replicas. This could help maintain service availability during consolidation.

TechWhiz42 -

Currently, that's not possible since the service needs to fetch data from an external vendor first and store it in our database. The dev team opted for a simple architecture due to time constraints.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.