I'm setting up a workflow where files get uploaded to S3, which sends messages to an SQS queue that triggers a Lambda function. This Lambda calls an API from a SaaS platform. If the SaaS goes down, the Lambda retries the call two times, and then the message moves to a Dead Letter Queue (DLQ). I'm trying to figure out the best way to redrive and reprocess these messages. Should I use EventBridge to schedule a trigger for the Lambda to redrive the messages back to the SQS queue, or would it be better to use Step Functions? I'm thinking it might be more efficient for the Lambda to check the DLQ first, redrive failed messages, and then process new ones from SQS. What do you all think?
1 Answer
It might be better to allow SQS to handle the normal retry behavior instead of pushing everything to the DLQ right away, especially if all items are calling the same SaaS API. This way, SQS will gradually back off during downtime, and you won’t flood the DLQ. Let SQS manage retries, which can help in reducing the load on your Lambda during an outage.
That makes sense! Can you explain a bit about how the gradual backoff works with SQS? Since the SaaS has a 24-hour recovery time objective, I'm trying to plan for the max number of messages that may be hitting the queue.