Hi folks! I'm working on boosting the development efficiency for a large ML engineering team tasked with data processing and reinforcement learning training on massive models (200B+ parameters). Their projects often take a few days to finish, and they also run numerous short processes to validate different training algorithms.
Here's the challenge: our research environments are incredibly diverse. Currently, we're relying on Docker images provided by our infrastructure team, but these images have become unwieldy (around 40-80GB) and take about 40-60 minutes to spin up. Breaking them down to rebuild with new libraries has everyone hesitant. Meanwhile, the teams are eager to adopt newer frameworks like Megatron and Transformer Engine, and they'd like to see a unified Docker image that updates nightly.
We've previously tried using conda on a shared CephFS, but it turned out to be problematic. Some core libraries couldn't be installed, leading to fragile installations with build errors and a polluted shared environment.
To tackle these issues, we've begun experimenting with **uv**, and the initial results seem promising.
1. Config-based environments: With a simple pyproject.toml, uv helps us define CUDA, custom repositories, and build dependencies in a much cleaner way than we've seen with conda.
2. Fast installs: Thanks to uv's cache system, we can install over 350 packages in under 10 seconds, and Docker images have shrunk tremendously.
3. Ray integration: Since many RL frameworks already work with Ray, uv fits seamlessly, allowing distinct environments across jobs on the same cluster.
4. Stability: We've encountered some bugs related to Ray, but most have been easily fixed.
That said, we're still in the early phases and have concerns about long-term stability, cache management, and best practices for multi-user setups.
I'd love to hear from anyone who has experience using uv in similar environments or any other advice, warnings, or alternate strategies you might have!
3 Answers
Curious about how you set up the cache. Is it integrated into your CI environments?
Have you checked out pixi? It’s developed by the same team that created Mamba and could offer some alternative solutions!
This sounds really interesting! While I’m not in the exact same field, I’ve been using uv for package management in my own CI/CD docker containers, and I can see how beneficial the "uv run --with" feature can be for quick iterations. It’s a game changer!

Exactly! For versions of frameworks like vLLM, there can be significant variations that require specific versions for particular models. The "uv run --with" has been essential for us, and it even let us define CUDA environments using pyproject.toml. We tried removing CUDA from our docker image entirely and everything still ran smoothly!