I've been diving into the capabilities of FSx for Lustre and I'm curious about its usage in machine learning environments. Specifically, I'm wondering if the sporadic usage of FSx is slow when dealing with large datasets, like 2TB of image data stored in S3, especially since copying and unzipping this data to the filesystem before each training job seems inefficient. I have some questions I'd love insights on:
1. Where do people typically store their machine learning training data? What about for image files that require high IOPS?
2. If FSx filesystems are provisioned at the start of training jobs, wouldn't it make more sense to use EBS instead? If I have multiple nodes running a job and each needs about 125MB/s, could I not just provision EBS systems instead?
3. Do researchers use the same data storage services for development as they do for actual training jobs? Any guidance on these would be greatly appreciated!
1 Answer
I faced similar frustrations with FSx. It felt too pricey to keep running all the time, and activating it on demand was just too slow. However, you can link S3 buckets to FSx to simplify things a bit—I heard that gives you a POSIX interface for your files, which might help improve your workflow. Have you explored that option yet?

Exactly! The costs can skyrocket if you're constantly bringing the system up and down, plus you might hit unexpected fees for data transfer between availability zones if you need machines elsewhere.