I've got a Lambda function set up that takes in records through a trigger, and for each incoming record, it outputs multiple records to a Kinesis stream. For example, when I receive one record, it generates ten records in the output. Now, I'm concerned about what happens during a service disruption while writing those Kinesis records. If I successfully write 9 out of the 10 output records and then encounter a failure, the trigger will attempt to reprocess the same input record. This would mean those 10 records could end up being processed again, resulting in 9 duplicates if successful. I've considered using a manual deduplication method with hashes stored in a DynamoDB table to check if a record has already been written. Is this the best approach, or are there other effective strategies?
2 Answers
What you’re describing is essentially Lambda idempotency, which means that doing the same task multiple times will yield the same result. You’ll need a type of persistent storage to help track your progress, and Lambda Powertools can be really useful for handling these cases. You can check out more about it and gather ideas by searching for 'Lambda idempotency.'
Great point! Having a unique identifier for each record can also help in tracking, and having that phrase 'Lambda idempotency' will definitely help streamline my searches.
You might want to let duplicates happen and manage them upstream instead. For example, taking a 'last write wins' approach in your data storage could help. Another way is to wrap your whole process in a try/catch block; if any of the 10 messages fail, just throw an error. If everything goes smoothly, then publish your whole set to the stream.
But what happens if the method for recording progress fails? You might have processed the data, but you're unable to log it.