I'm designing an API that triggers a workflow to process large folders containing codebases, typically around 1GB each. The workflow isn't heavily compute-driven, but I need fast regex searches across files. I want to keep costs low and the architecture simple since this will be used infrequently on-demand. Here's my current setup:
- I plan to store each project folder as a zipped file in S3.
- When a request comes in, I'll use a Lambda function to:
- Download and unzip the folder
- Perform regex searches and run some tasks with an LLM (using the OpenAI API).
More details:
1. Total size: 1GB per project.
2. Expected use: 10-20 requests/day for one specific project, with plans to expand.
3. Response time isn't critical; the entire workflow averages 15-20 seconds.
4. The regex requirement is specific to client needs, generating patterns based on various inputs.
5. Semantic or symbol-aware search isn't necessary.
6 Answers
Since you're getting about 10-20 requests daily with 1GB files, don't overcomplicate it! What you have in mind works fine. Consider exploring if streaming the process is possible for efficiency: stream from S3, unzip, search, and collect results. If streaming isn't feasible, try running AWS Lambda Power Tuner to optimize memory usage for the best price-performance balance.
One consideration is how long it takes to process that 1GB file—Lambda can timeout after 15-20 minutes. What’s your workflow once you get the results? You mentioned producing a text report using regex . If your workflow is not resource-heavy and the regex part is the most intensive, then deploying this via API sounds reasonable.
I do something similar with AWS Step Functions. My process goes like this: when the API request comes in, I create a presigned URL for a user to upload their file. Then, an S3 event triggers an SQS queue that starts the Step Function. Using the Map state, you can handle multiple pages at once, which is helpful if you're processing many items. This method is serverless and lets you integrate easily with various AWS services. You can also output results through an SNS topic or a signed URL for the user to download.
Cool. Thanks for the input! We’re aiming to keep things straightforward and serverless since we're a small startup and only have one client using this workflow for now.
I think you're on the right track, but I need a few more details to really help. What’s the total size of your data? How often do the files change, and do you have any specific access control requirements? It might be beneficial to consider having the data stored locally on a server and then using a simple grep if that's a feasible option.
I've updated the post with more details, please check!
Here’s a couple of suggestions: Zipping might not provide enough savings to justify the overhead. You could use tar and stream the content for regex processing instead. Additionally, consider using shared EFS to cache projects efficiently. Clearing out projects based on LRU could help manage space effectively.
Thanks for the suggestion!
AWS Athena might be perfect for your needs. It can directly query your files in S3 and decompress zip files transparently. This could save you some processing time.
The whole workflow takes about 20-30 seconds, and I want to trigger this via an API and return the generated output for use by different services.