How to Efficiently Merge XML Files from Multiple Devices?

0
0
Asked By TechieGamer42 On

I have a system with over 31,000 devices sending data every 5 minutes through a REST API. Each request triggers a Lambda function that saves the data in separate XML files for each device, organized in an S3 bucket by date and serial number. With 597 files created per device daily, that adds up to around 18.5 million small files! I'm looking for the best way to merge these files into fewer, larger files—ideally one file per hour, each containing an hour's worth of data. I also need to consider that the files should be at least 128 KB due to S3 Glacier's billing requirements. Can anyone suggest the best tools or methods for merging these files, such as Lambda, Airflow, Kinesis, or Glue? Thanks for any tips!

4 Answers

Answered By CloudWizard57 On

You might want to consider using SQS to offload the requests instead of directly hitting S3, especially if the payloads are small. You can push the incoming data to an SQS queue and have a Lambda function pull from there and merge files into larger ones. This way, you can efficiently manage the size of your uploads and reduce S3 costs. Also, you could set your Lambda to run hourly to bundle messages and generate a file for that hour.

DataNerd99 -

Keep in mind that if payload sizes exceed 256 KB sometimes (like when devices store two days' worth of data), SQS might not be the best route. You'll have to think about balancing those payloads.

Answered By StreamlineGuru On

Have you looked into Kinesis Firehose? It can help you manage the data stream and combine files as they come in. If you set it up correctly, it can organize your files into the right structure while reducing the required number of S3 objects.

LearningAWS2023 -

Firehose is a good shout, but I need to maintain my current S3 structure. I'm worried I might lose that organization.

Answered By DataProcessorPro On

You could benefit from running a Glue job to merge the files into a single processing task, possibly leveraging a Python script for ease of use. Another option is to utilize AWS Batch with Step Functions to automate the merging, which could save time and resources too.

CloudTrailBlazer -

Could you share more on how that implementation can work? I'm really interested in setting up a daily process for file merging!

Answered By DiskSpaceSaver On

Consider the total costs of your processing method. A Spark job could effectively handle the file merging, dealing with your XML format directly. If you can structure EMR to process daily, your running costs would be manageable too. Just watch out for the need to refetch from S3 for processing.

NimbusArchitect -

True, and that back-and-forth with S3 could stack up in costs. It's a fine line to walk!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.