Should I Store One Small File Per SQS Message in S3, or Batch Multiple Messages Together?

0
16
Asked By CuriousCoder42 On

I'm working on an application where events are sent to SQS, and then a consumer processes those messages and writes them to S3. The messages are quite small, and ultimately I need to load these files into a data warehouse. I'm trying to decide whether it's better to create one S3 file for each message, resulting in lots of tiny files, or to combine multiple messages into larger files before sending them to S3. If batching is the way to go, what are the common strategies for this—are people using size-based, time-based, or a combination of both? The data doesn't need to be real-time, but it does need to be available in the data warehouse within 5-10 minutes of receiving the event. I'm looking for best practices or lessons learned on this.

3 Answers

Answered By DataSaver99 On

When thinking about your approach, keep in mind that S3 has a minimum object size of 128KB that can be beneficial for keeping storage costs down. If you write one file per message, you'll rack up a lot of requests, which can get expensive. Combining messages into single files is usually the way to go. Also, check out how CloudTrail handles it—they batch events in a compressed format. This could be a good model for you.

Answered By PrudentPiler On

Just a heads up, if you're creating one data point per S3 file, it could get really costly due to the S3 API request charges. Plus, if you end up with a ton of tiny files, pricing can add up quickly. You should also consider what happens if a batch message gets lost or if there's an issue with data availability—it might complicate things. Just remember, S3 is typically used as a raw data store before pushing that data into a database.

QueryQuokka -

If S3 is more of a staging area for your raw data, have you considered other options for handling this data more efficiently before it gets put into a database?

Answered By BatchingBoss On

I think time-based batching is the most effective. If you batch messages once a minute, you can significantly reduce your S3 request costs. This way, you can still get your data ready for loading without waiting too long.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.