I'm working at a mid-sized company that runs most of our infrastructure on Kubernetes (mainly EKS). Our security team has given the thumbs up for us to use an AI coding assistant, but only if we can host it ourselves, ensuring no data leaves our network. I've been looking into what it takes to set this up, and it's turning out to be more complicated than I thought. The assistant requires GPU nodes for inference, so we need to deal with the NVIDIA device plugin, establish resource quotas for GPU usage, and likely set up dedicated node pools to prevent interference with production services. I'm really curious if anyone has experience with this kind of setup, specifically regarding:
- How you managed GPU scheduling and resource allocation
- Whether you created a dedicated namespace or opted for a separate cluster entirely
- What the real resource requirements are (like how many GPUs for about 200 developers)
- How you handle model updates and versioning
- Any latency issues that impacted developer experience
I know there are cloud-hosted options available, but we can't consider that route. I'd appreciate any insights or experiences about the operational overhead involved in on-prem deployment.
6 Answers
Just a heads up, if you're looking at spending close to a million bucks for eight GB200s, you might not get the performance you want for a cutting-edge model. Smaller models that fit on one GPU just don’t stack up against the full commercial AI coding tools. And I find it kind of funny that you're using EKS while insisting no code leaves the network. We have 12 H100 GPUs set up with vllm, and operationally, it’s pretty straightforward after you get everything configured. We use Slurm for management, and our nodes are allocated statically, so there's no on-the-fly scaling and ramping.
It's true; every developer wants their own A100 card! Or you could go for the budget-friendly option and hire more developers instead.
Have you considered whether the operational overhead really justifies self-hosting? We looked into it, and the total cost of ownership (TCO) for self-hosting was way higher than just going for an enterprise cloud plan that had proper contractual data protections. Unless you're in a highly sensitive field where air-gaps are a must, having professional legal agreements with a cloud option might be the more practical route.
We set this up around eight months ago! We created a dedicated node pool with 4 A100 GPUs in a separate namespace. Instead of managing the NVIDIA device plugin manually, we used the NVIDIA GPU operator, which made things way easier. The resource needs really depend on the size of the model and how many users are accessing it at once. For about 150 developers, four GPUs worked well, since not everyone is hitting inference simultaneously—typically, peak concurrent usage is around 20-30% of our dev count.
For managing model updates and versioning, we handle model files like any other container image. They’re uploaded to our internal registry with semantic versioning, and we perform rolling updates using Kubernetes deployments. Since the models can be several gigabytes, ensure your registry and nodes have ample storage. We faced disk pressure alerts the first week because we didn’t consider that we needed to keep earlier versions around during rollouts.
Are you thinking about multitenancy? Like, would different teams want to control their GPU loads, or are you planning to offer this as a service/API within your company?

Related Questions
Neural Network Simulation Tool
xAI Grok Token Calculator
DeepSeek Token Calculator
Google Gemini Token Calculator
Meta LLaMA Token Calculator
OpenAI Token Calculator