I'm looking for a way to utilize my local SSDs with an EC2 instance for training models. I have around 200TB of data, but I only need to access roughly 1GB at a time for batch processing. My goal is to keep the bulk of this data on my local drives to avoid the steep costs and privacy concerns associated with AWS storage solutions. I'm aware that there will be some latency when loading each batch from my local SSD to EC2, but that's acceptable for my needs. Can anyone advise if this setup is possible or suggest alternative methods to manage this without relying heavily on S3?
5 Answers
Are you planning to run your EC2 instance with just 1GB of data at a time, processing it, and then stopping the instance? If that’s the case, you might need to upload your data in 1GB chunks to S3 every time, which sounds tedious.
For large datasets, it's way better to keep your data close to your compute resources. Since you only need 1GB at a time, using an EC2 instance with NVME storage is a good plan. This way, you can transfer data right to the instance before running your computations.
I’d love to avoid S3, and it feels like using ephemeral storage might be the answer. I want to process small batches efficiently without maintaining all the data in the cloud.
If you have NAS, you can connect it to your VPC using a VPN. That could help in keeping your drives accessible without unnecessarily pushing everything to AWS.
You can't really use a local SSD directly with EC2 for storage. The best approach is to choose the specific data you want to upload to AWS for processing. Trying to access your drives over a VPN will just slow everything down.
That was my suspicion. Just hoping for a workaround that might exist.
Consider setting up EBS encryption for better privacy. You can upload small chunks, and while one is being processed, upload another. It's a more manageable approach!
To clarify, I’m going to train a model on 1GB batches sequentially until I get through all 200TB. I need a way to handle the data without drawing so much from S3 storage.