I've been exploring the use of FSx for Lustre and how it might fit into my machine learning workflows. From what I've gathered, it seems that FSx isn't always utilized efficiently—especially in cases where data transfer is involved. With 2TB of image data stored in S3, transferring and decompressing the data to the filesystem could slow down my training jobs significantly. I'm looking for some advice on a few points:
1. Where do people typically store their ML training data? If I'm working with JPEG images needing a high number of IOPS, what's a recommended approach?
2. FSx filesystems are created when initiating training jobs, but why not use EBS instead? Considering that if N nodes require around 125Mb/s, isn't it easier to provision multiple EBS volumes?
3. Do researchers use the same storage solutions for development and actual training jobs, or is there a divergence in approach?
Any insights or general trends in the industry would be super helpful!
3 Answers
I’ve found FSx for Lustre isn’t ideal for ML. We benchmarked it at Microsoft, and while the idea of lazy file loading seems good, the actual time to load large datasets from S3 can be a deal breaker—30-45 minutes just for 2TB is no joke!
Most teams I’ve seen create a hybrid solution: keeping frequently accessed datasets on EFS (even with the IOPS costs) while using S3 for storage. Some companies pre-stage their training data onto NVMe drives attached to GPUs for much faster access. It's not perfect, but definitely better than relying on network-attached storage.
As for separate systems for dev and prod, in practice, many just use the same S3 buckets, which opens up risks for overwrite issues. Some do enforce versioning, though that’s still not a common practice among teams.
After spending time at AWS on their EFS and FSx teams, I can tell you that many people store their ML training data in S3. It's commonly referred to as the "source of truth." For IOPS concerns, some combine small JPEG files into larger files, minimizing the number of S3 requests.
As for EBS versus FSx, you’ve got a point about EBS being straightforward but remember each node pulls data separately, leading to delays if you're not careful. FSx's S3 integration can help with that.
And regarding development versus production storage, it really varies. Some teams use the same setup, while others maintain separate environments to avoid mix-ups, which is always a risk with shared buckets!
I like that idea of sharding with something like webdataset, but if your data needs tweaking, it can be a real pain. Do you think FSx is still slow due to the requests to S3?
I actually moved away from FSx for similar reasons—it was pricey to keep running, and spinning it up on demand took forever. However, you can link S3 buckets to FSx, which can help streamline the process a bit and allows for direct access to files. It can alleviate some of the hassle of constantly transferring data.
Totally get that! And if you switch zones, you also have to deal with extra data transfer costs. What alternative solution did you go with?

I tried out EFS too, but it was so slow for loading data that I ditched it. Your plan to keep cold storage in S3 and set up performance storage as needed sounds like a solid approach. Plus, those NVMe drives are a game-changer if you can set it up that way.