How to Optimize Node Resource Usage for Imbalanced Workloads in GKE?

0
2
Asked By CaptainRandom42 On

I've been working with workloads in Google Kubernetes Engine (GKE) where the resource utilization is somewhat optimized, particularly using the Vertical Pod Autoscaler (VPA) and Pod Disruption Budgets (PDB). However, we have a complicated setup: five different deployments are subscribing to a single topic with each pod linked to specific partitions. The problem arises from the varying data volumes across these partitions, resulting in imbalanced CPU and memory usage among the pods.

The VPA calculates resource recommendations based on the average usage across all pods in a deployment, which isn't ideal for our case since it doesn't consider pods that experience heavy usage. This can lead to multiple CPU-heavy pods being allocated on the same node, resulting in CPU throttling.

I've also tried setting CPU requests based on the highest usage, but that leads to extra nodes and higher costs. Currently, we're managing this by running cron jobs to adjust the minimum CPU requests in VPA during peak times, but it's not a perfect solution, especially when it results in over-allocating resources for some pods.

I'm curious if anyone else has dealt with this situation. Is there a way for the VPA to use peak resource usage instead of the mean?

4 Answers

Answered By DevSquadLeader On

I get that trying to find the balance is tricky. Setting CPU requests based on peak usage might seem like a good idea, but it can create chaos in the scheduler. If you're running a constant flux of usage, it could mess up your nodes! Ideally, you'd want stable requests. Have you considered keeping lower requests for your lightweight pods while still allowing heavier pods to maintain higher requests? This way, you could optimize how many smaller pods you can fit without causing node pressure.

Answered By StackTraceHunter On
Answered By TechGuru77 On

The latest Kubernetes version (1.33) introduces a beta feature allowing you to change resource requests on the fly. You could leverage historical data to adjust resource requests dynamically based on past usage. I'm not sure if there's an operator for that yet, but I did create a proof-of-concept using GPT, which helped me get about 80% there pretty quickly! Once AWS, GCP, or Azure adopt 1.33, this could save a lot of costs—it might suit your situation nicely.

DataDrivenDude -

Oh, that sounds interesting! What did you set up with GPT? Was it related to the new feature or something else?

Answered By CodeWiz89 On

Have you thought about using KEDA? It can autoscale based on metrics other than just CPU or memory, specifically metrics from your message queues if you set that up with something like Prometheus. That way, your pods can scale based on actual load rather than resource usage alone.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.