How can I merge large zip files using AWS Lambda?

0
1
Asked By BioDataWizard42 On

Hey everyone! I'm facing a challenge with handling a massive biology dataset that totals up to 500GB. I need a way to merge this data into a single zip file for storage on S3. Given that requests for this data are infrequent and usually on a smaller scale, I'm considering using AWS Lambda to manage the process efficiently. However, I foresee some complications with larger requests. My plan is to break the data into multiple chunks, then use Lambda to download the individual files from S3, create the zip files, and upload them back to S3. Finally, I'd like to combine these zip files into one large zip after all parts are processed. I'm aiming to avoid the costs of launching an EC2 instance for these rare large data exports, and I need to handle everything through streaming to prevent memory issues. If anyone has experience with similar situations or good solutions, I'd love to hear your thoughts! Thanks a lot!

7 Answers

Answered By MultiPartMaster_44 On

My first thought would be to create the final large zip as a multipart upload. This way, you can break the task down over several Lambda invocations, possibly using SQS to manage the workflow.

Answered By PythonPro_19 On

I’ve built something like this in Python! Just keep in mind Lambda's memory limits when working with large files—Python’s boto3 SDK is really handy for S3 operations like zipping and unzipping.

Answered By QueryMaster_5 On

Quick question—how is that 500GB dataset structured? What are the sizes of the individual files you're working with?

Answered By TechyNerd_89 On

There's a StackOverflow thread describing how to combine objects in S3 into one big one without needing temporary storage. It uses the multi-part upload API, which is solid for your data size. I'd also recommend testing the code on EC2 to check memory usage and execution time because if it exceeds 15 minutes or requires more than 10GB memory, you might have to switch to a Fargate Container instead, which could help with your workload better than Lambda.

CloudGuru_31 -

You're spot on! Just to add, that directory buckets in S3 allow appending to objects without the 5MB limit. Appending piece-by-piece is doable there, but transferring it to a general-purpose bucket after that could be a worthwhile process.

Answered By LambdaLover_67 On

You definitely need to split the heavy processing between two Lambdas since Lambda has a max timeout of 15 minutes. It’s an efficient way to handle long tasks without hitting limits.

Answered By DataFanatic_77 On

Using Fargate sounds like a great idea! However, remember that merging zip files might not work as expected; if you simply concatenate zips, only the last one would maintain its data due to how zip files store their directory structure. Just a heads up!

Answered By BuildItRight_22 On

Just a reminder, zipping isn’t the only option. You can create a .tar file instead, which was designed for combining files. And if you’re open to other platforms, Azure has ‘append blobs’ that could simplify this process!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.