Hey everyone! I'm managing a Kubernetes setup on DigitalOcean, which doesn't support Karpenter. Instead, I'm manually handling capacity planning, node rightsizing, and topology design with the Cluster Autoscaler. Right now, I'm doing the following: analyzing workload behavior, checking CPU and memory requests versus actual usage, categorizing workloads, creating specific node pools, adding buffer capacity for peak loads, and tracking in a Google Sheet. While it works to some extent, this method is pretty manual, time-consuming, and prone to errors. I'm looking for advice on tools or workflows that can help automate or enhance node rightsizing, binpacking strategy, and overall cluster topology planning. Any recommendations would be appreciated!
3 Answers
It seems like you’ve got a solid approach already! Here are a few tips:
- Consider using message queues instead of relying solely on HTTP requests. This can really help with scaling since you can scale based on the size of the queue.
- Limit the number of node groups; having too many can slow down the Cluster Autoscaler significantly.
- Try to keep fewer nodes in each availability zone to mitigate issues like noisy neighbors. The exact number can depend on factors like pod anti-affinity rules.
- Larger nodes can improve binpacking and reduce overhead but may be less efficient at autoscaling.
- If you're able to incorporate KEDA for scaling based on queue lengths, that works great!
Just a thought — are you sure your scaling metrics are spot-on? Relying only on CPU and memory might not be sufficient, depending on your applications. Have you considered other metrics?
You're right! CPU and memory alone often miss the full story.
We actually use tailored scaling strategies:
- For single-threaded applications like Node.js, we use HPA based on CPU.
- Databases are scaled vertically with fixed replicas per VPA recommendations.
- For web servers like Apache, scaling is based on HTTP worker processes.
It’s not perfect, but it seems to work pretty well so far. Thanks for the reminder to always reassess our scaling methods!
Have you checked out Cast AI? They specialize in autoscaling and while DigitalOcean isn’t supported yet, they could help with optimizing HPA and VPA for your workloads, plus monitoring cost efficiency across your cluster.
Thanks for the great checklist, it's super helpful!
We're already using RabbitMQ for background jobs and KEDA for scaling, which definitely helps stabilize things. Also, I had no idea that too many node groups could lead to slowdowns — I’ll work on simplifying our pool structure. Your feedback has been really enlightening!