I'm trying to deploy a text classification model using a BentoML image in a Kubernetes pod on an m5.large instance. I've set up 2 workers in the image and the pod consumes about 2.7Gi of memory. Despite configuring resource requests and limits to ensure QoS, the pod won't use more than around 50% of the CPU, even when I tested on a larger instance type. Interestingly, if I deploy another pod on the same node, it will utilize the leftover CPU resources. Can someone explain why my single pod isn't able to fully use the node's CPU resources?
3 Answers
Just a heads up, CPU limits can restrict your access to CPU power even if it's not being fully used by the pod. What if you remove those limits and only keep the requests? That might free up some resources for you.
This situation might actually be a great reason to dig into metrics and observability tools. You'll want to check if your pod is hitting resource limits or facing throttling issues. Here's a useful resource on that: [Prometheus queries for CPU and memory](https://signoz.io/guides/prometheus-queries-to-get-cpu-and-memory-usage-in-kubernetes-pods/#how-to-query-cpu-usage-in-kubernetes-pods-with-prometheus).
If your BentoML setup is single-threaded, that might be why it's capped at about 50% of the CPU usage, since it can only use one CPU at a time. Check if you can configure the number of workers because that could help.
Yeah, BentoML does seem to allow you to set up more workers. How many do you have configured right now?
When I used an m5.large, it maxed out at 1100m CPU and didn't budge. Moving to an m5.xlarge only got me to 2100m. I'm curious if there's something in the setup I should be looking at.
Right! CPU requests dictate how CPU is shared among pods. Removing limits could definitely help if you're experiencing throttling.