I'm working on a project where I need to scrape an API that only allows 1000 calls per hour, and I need around 41,000 calls (one for every zip code in the US). The results will go into a DynamoDB (DDB) caching table and an items table, and I also have a DDB tracker table to monitor progress and handle errors like rate limiting and failures. I previously ran a script that took around 100 hours, which is way too long. Right now, I use a monthly EventBridge rule to kick things off but I'm not sure how to repeatedly invoke the Lambda without overshooting the rate limit. Should I blast 1000 calls in one go, or spread them out? I want to avoid excessive costs related to running functions and am curious about technologies like Step Functions or anything else that could help streamline this process. Any advice?
4 Answers
Step Functions are really helpful for situations like this! They manage long-running processes without needing to resort to setTimeout, which can get costly. You could set up a state machine that processes a batch of zip codes, logs progress in your tracker table, and then hands off control to the next step. This method lets you include wait states and stay under the API's rate limits, so you won't be paying for idle Lambda time or risking infinite recursion. Plus, you get built-in retry logic and visibility into the workflow!
You could also try getting multiple API keys to make parallel calls! That way, you can maximize your scraping efficiency without running into limits.
Another idea is to set it up where your monthly invocation creates a new hourly EventBridge rule. Once you've processed all the items for that month, you can delete the rule. This might help in managing your scraping rhythm better without constantly re-triggering manually.
How often do you actually need the data? If it's just a one-off job or something you need daily, maybe consider using a long-running executor like an ECS task. ECS can handle concurrent requests better than Lambda in some cases. And I’d still throw in SQS to manage how you handle those API calls.

Exactly! A distributed map in Step Functions is perfect for your case.