I'm looking to implement a job scheduling system that allows multiple users to train their machine learning models in isolated environments. My plan is to use Kubernetes to dynamically scale my EC2 GPU instances based on demand. Has anyone tackled a similar setup before?
4 Answers
For your requirement of individualized training sessions, Karpenter could help scale node pools based on unscheduled workloads needing GPU access. However, I recently found Kueue, which seems to provide a more streamlined approach for managing GPU jobs in these scenarios.
Yeah, absolutely! You might want to check out cluster-autoscaler for managing your node scaling effectively. We used NVIDIA's gpu-operator which worked wonders for us. Initially, we used Postgres as a dummy queue to manage workload demands, but later switched to Deadline since we were in a VFX environment. Setting up a queuing system allows developers to submit jobs without needing to handle the backend intricacies, giving them a user-friendly platform to work with.
We actually implemented something like this! Karpenter was a game changer for us—it simplified everything to just creating and destroying K8s deployments while automatically managing the underlying instances. You can even impose specific constraints to ensure the right instance types are used. In practice, it worked remarkably well, supporting both training and model deployment. We had no issue managing a couple of hundred instances at once.
Don’t forget to check out SLURM and Slinky if you're exploring options for job scheduling and resource management.
Related Questions
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically
[Centos] Delete All Files And Folders That Contain a String