Has anyone successfully deployed AI coding tools on-prem in their Kubernetes clusters?

0
17
Asked By CreativeCoder99 On

I'm working at a mid-sized company that runs most of our infrastructure on Kubernetes (mainly EKS). Our security team has given the thumbs up for us to use an AI coding assistant, but only if we can host it ourselves, ensuring no data leaves our network. I've been looking into what it takes to set this up, and it's turning out to be more complicated than I thought. The assistant requires GPU nodes for inference, so we need to deal with the NVIDIA device plugin, establish resource quotas for GPU usage, and likely set up dedicated node pools to prevent interference with production services. I'm really curious if anyone has experience with this kind of setup, specifically regarding:
- How you managed GPU scheduling and resource allocation
- Whether you created a dedicated namespace or opted for a separate cluster entirely
- What the real resource requirements are (like how many GPUs for about 200 developers)
- How you handle model updates and versioning
- Any latency issues that impacted developer experience
I know there are cloud-hosted options available, but we can't consider that route. I'd appreciate any insights or experiences about the operational overhead involved in on-prem deployment.

6 Answers

Answered By SeriousTechGiant On

Just a heads up, if you're looking at spending close to a million bucks for eight GB200s, you might not get the performance you want for a cutting-edge model. Smaller models that fit on one GPU just don’t stack up against the full commercial AI coding tools. And I find it kind of funny that you're using EKS while insisting no code leaves the network. We have 12 H100 GPUs set up with vllm, and operationally, it’s pretty straightforward after you get everything configured. We use Slurm for management, and our nodes are allocated statically, so there's no on-the-fly scaling and ramping.

Answered By DevGuru007 On

It's true; every developer wants their own A100 card! Or you could go for the budget-friendly option and hire more developers instead.

Answered By PracticalITPro On

Have you considered whether the operational overhead really justifies self-hosting? We looked into it, and the total cost of ownership (TCO) for self-hosting was way higher than just going for an enterprise cloud plan that had proper contractual data protections. Unless you're in a highly sensitive field where air-gaps are a must, having professional legal agreements with a cloud option might be the more practical route.

Answered By TechieTribe42 On

We set this up around eight months ago! We created a dedicated node pool with 4 A100 GPUs in a separate namespace. Instead of managing the NVIDIA device plugin manually, we used the NVIDIA GPU operator, which made things way easier. The resource needs really depend on the size of the model and how many users are accessing it at once. For about 150 developers, four GPUs worked well, since not everyone is hitting inference simultaneously—typically, peak concurrent usage is around 20-30% of our dev count.

Answered By ModelMaster99 On

For managing model updates and versioning, we handle model files like any other container image. They’re uploaded to our internal registry with semantic versioning, and we perform rolling updates using Kubernetes deployments. Since the models can be several gigabytes, ensure your registry and nodes have ample storage. We faced disk pressure alerts the first week because we didn’t consider that we needed to keep earlier versions around during rollouts.

Answered By InfrastructureNinja On

Are you thinking about multitenancy? Like, would different teams want to control their GPU loads, or are you planning to offer this as a service/API within your company?

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.