Has anyone deployed AI coding tools on-prem in Kubernetes clusters?

0
9
Asked By CuriousCoder24 On

I'm working at a mid-sized company that runs most of our infrastructure on Kubernetes (EKS), and we've received approval from our security team to use an AI coding assistant. However, the condition is that it must be self-hosted, with no code leaving our network. So far, I've dived into the complexities of setting this up, which seem to be more than anticipated. The AI tool requires GPU nodes for inference, which leads to challenges like figuring out the NVIDIA device plugin, managing resource quotas for GPU time, and potentially needing dedicated node pools so that inference workloads don't interfere with our production services.

I'm curious to hear if anyone else has gone through this process. Specifically, I'd love to know:
- How did you manage GPU scheduling and resource allocation?
- Did you opt for a dedicated namespace or a separate cluster altogether?
- What are the actual resource requirements, say for around 200 developers?
- How do you handle model updates and versioning?
- Did you face any latency issues that impacted the developer experience?

I know that some of these tools offer cloud-hosted solutions, but that's not an option for us. Any insights on the operational overhead involved in this on-prem deployment would be greatly appreciated.

4 Answers

Answered By DataNinja13 On

You might want to consider if the operational overhead is really worth it versus using a tool that offers a SaaS option with a BAA or DPA. We evaluated self-hosting and found that the total cost of ownership (TCO) was significantly higher than just going with a cloud plan that has legal protections in place. Unless you’re in a defense sector or have strict air-gap requirements, the cloud solution could be more practical!

Answered By ModelMaven42 On

For handling model updates and versioning, we treat model artifacts like standard container images. They get pushed to our internal registry using semantic versioning, with rolling updates managed through Kubernetes deployments. Just be aware that the models tend to be large (several GB), so ensure your registry and nodes have enough storage available. We encountered disk pressure alerts early on because we didn’t plan for keeping the previous model version cached during rollouts.

Answered By TechGuru88 On

We did this around 8 months ago with a dedicated node pool using 4x A100 GPUs in a separate namespace. Instead of managing the NVIDIA device plugin manually, we leveraged the NVIDIA GPU operator, which made lifecycle management a lot easier. The resource requirements really depend on the model size and how many users are on it at once. For about 150 developers, 4 GPUs turned out to be enough because not everyone is using inference at the same time, usually peaking at about 20-30% of our dev count during busy times.

Answered By DevWhisperer On

Isn’t it a bit ambitious to think every dev will get their own A100 card? You might need to consider building a "budget-friendly" version if you're planning to hire more developers.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.