I'm trying to deploy a text classification model using a BentoML image in a Kubernetes pod on an m5.large instance. I've set up 2 workers in the image and the pod consumes about 2.7Gi of memory. Despite configuring resource requests and limits to ensure QoS, the pod won't use more than around 50% of the CPU, even when I tested on a larger instance type. Interestingly, if I deploy another pod on the same node, it will utilize the leftover CPU resources. Can someone explain why my single pod isn't able to fully use the node's CPU resources?
2 Answers
This situation might actually be a great reason to dig into metrics and observability tools. You'll want to check if your pod is hitting resource limits or facing throttling issues. Here's a useful resource on that: [Prometheus queries for CPU and memory](https://signoz.io/guides/prometheus-queries-to-get-cpu-and-memory-usage-in-kubernetes-pods/#how-to-query-cpu-usage-in-kubernetes-pods-with-prometheus).
If your BentoML setup is single-threaded, that might be why it's capped at about 50% of the CPU usage, since it can only use one CPU at a time. Check if you can configure the number of workers because that could help.
When I used an m5.large, it maxed out at 1100m CPU and didn't budge. Moving to an m5.xlarge only got me to 2100m. I'm curious if there's something in the setup I should be looking at.
Yeah, BentoML does seem to allow you to set up more workers. How many do you have configured right now?