Hey everyone! I'm working with a series of local hard drives and I've got about 200TB of data, but I only need to access around 1GB of it at a time for model training on an EC2 instance. Since storing all that data on AWS would not only be super costly (around $2K a month) but also raises privacy and confidentiality issues, I'm looking for a solution where I can keep the data local and just use the EC2 instance for processing small batches. I know there will be latency when loading each batch from local storage to the EC2 instance and then clearing it out, but I'm willing to accept that trade-off. Is there a way to make this work, or are there better alternatives to avoid those hefty S3 storage fees for data that I won't need constantly? Thanks in advance!
5 Answers
You might want to think about using EC2 instances equipped with NVMe ephemeral storage. This setup could help with the fast I/O you need, as it would allow you to stage your data there before computing—perfect for high-performance compute tasks. Just be ready to clean up afterwards since it's temporary storage.
If privacy is a concern, consider setting up EBS with encryption. You could upload a chunk of your data, process it, and upload another one while the first is being computed. It might help make everything safer on AWS.
Honestly, it sounds like you might be stuck with needing to upload data to AWS, at least in some form. Storage Gateway is mostly meant for moving data to S3 rather than giving you access to local storage directly on AWS. My thought is that you’d need to pick your data, upload it, and using a VPN might just slow things down further.
That was my hunch. Still hoping for a workaround, though!
Another thought: you could put your drives in a NAS and connect to it via a VPN or Transit Gateway to your VPC. It could save you a lot of trouble and keep things local while still accessing your data remotely.
Are you thinking of running your EC2 with 1GB chunks, processing one batch at a time? If that's the plan, you really have to upload everything in 1GB increments over and over. But if you load 200TB and process it as needed, incoming data isn't charged, so maybe just export your data and let EC2 handle the heavy lifting without fear of those S3 fees.
That makes sense. Yes, I'll be processing the batches sequentially on the same instance. Just trying to figure out the most cost-effective way to manage all this data!
That sounds like a solid option! I'd love to avoid S3 entirely if possible, though. I guess I just need to stage enough data at once to make it efficient!