I've been diving into a project aimed at exploring how Large Language Model (LLM) workloads function in a Kubernetes environment. My current setup is relatively simple and involves: a local Minikube cluster, Ollama, a Flask API running inside Kubernetes, exposed through NodePort, along with liveness and readiness probes, basic resource limits, and metrics gathered via kubectl top. The goal is to simulate a mini production workflow—deploying, accessing, breaking, debugging, and observing. However, I'm wondering if this setup feels a bit too simplified for real-world applications. Here are some of my key concerns: Is Minikube too basic for this type of project? Should I explore alternatives like EKS or kind for more accurate simulations? What meaningful failure scenarios could I test? Also, am I missing any crucial elements related to reliability or observability? I'm not looking to complicate things unnecessarily; I just want to ensure it's practical beyond just a demonstration. I'd love to hear your insights!
5 Answers
For a more production-oriented environment, you might need to incorporate authentication, potentially through OIDC. While Ollama is nice, it’s somewhat of a toy compared to more serious engines like vllm. You’ll also want to think about model storage, likely using S3, and figure out how to track token consumption for billing. Scalability is key here since these services can scale horizontally. Don’t forget the importance of GPU operators from NVIDIA when working with GPUs and Kubernetes! Tuning and benchmarking will also come in handy.
Your learning setup seems solid, but Minikube might limit you for real scenarios. It’s great for practice but you'll hit walls when you start diving deeper.
Consider adding authentication to your setup; proxying it behind something like litellm or exploring vllm which supports authentication would be helpful. If you’re looking to debug LLM prompts, langfuse could be useful. Also, make sure you monitor GPU usage for optimization. Keep in mind, local setups might lack proper GPU scheduling optimizations—for example, using MIG to handle multiple models on a single GPU. Those are some of my thoughts to help enhance your project. Enjoy!
Ok sure, I’ll keep this in mind while executing, thanks!
To really simulate failures, try randomly killing pods or introducing network failures. This will show how robust your application is under unpredictable conditions like restarts or brief disconnects. Just be cautious with liveness probes—sometimes they can cause endless restart loops if the startup takes too long.
I’d suggest swapping Minikube for either k3d or kind. The networking layer in Minikube can create misconceptions about how things really work in a true cluster. Rather than Ollama on your host, look into running vLLM or KServe inside the cluster, coupled with a Horizontal Pod Autoscaler driven by KEDA based on queue depth—this is closer to standard production patterns for LLM serving. Adding OpenTelemetry with a collector sidecar will also aid you in gathering real inference latency metrics.
Ok sure, thanks for the feedback!

Hey, that's my first project related to K8s deployment. In the next one, I plan to try EKS. Thanks for the feedback!