I'm looking to set up a job scheduling system that allows multiple users to train their machine learning models in isolated environments. I want to use Kubernetes to dynamically scale my EC2 GPU instances based on demand. Has anyone implemented something like this? Any advice or experiences to share?
4 Answers
We've implemented a similar setup using Karpenter, which made it pretty easy to manage Kubernetes deployments. Karpenter automatically handles scaling the instances for you. It's great because you can set constraints for instance types when creating deployments. We supported both CPU and GPU workloads, and it was efficient even with several hundred instances being spun up and down. If your job fits an existing instance, it starts almost instantly; otherwise, provisioning a new one takes just a minute or two.
Yes, you can set it up to run machine learning jobs at specific times. It's handy to manage by individual users or teams, and each person can submit multiple job requests, ensuring that each one has its own separate training session. That way, you maintain order and resource allocation.
You should definitely check out the cluster-autoscaler for handling node scaling. Also, using NVIDIA's GPU operator is a game changer. We initially set up a PostgreSQL server as a dummy queue to manage demand, but later transitioned to Deadline for our needs. Queueing jobs like this allows users to submit their requests without having to manage the underlying infrastructure personally, creating a mini platform for better control. Don't forget to use tools like Argo or Flux for continuous delivery to keep everything running smoothly!
I’ve also worked with Karpenter for scaling node pools as needed for GPU access. It worked quite well. Recently, I discovered Kueue, which looks promising for managing job queues in Kubernetes. This might have what you need for your workloads! Check it out!
Nice! Just curious, when you have multiple trainings on the same instance, how do you maintain isolation to prevent data leakage between users?