Hey everyone! My team heavily utilizes AWS Lambda functions to process messages from SQS queues. Some of these functions run with longer execution timeouts, around 10 to 15 minutes, and can retry a number of times (up to 10). Given that the recommended visibility timeout for messages is twice the Lambda execution timeout, we sometimes notice messages failing for extended periods before they finally land in the dead-letter queue (DLQ). We're looking for a way to receive alerts if a significant number of messages fail to process before they hit DLQ.
Currently, we use DataDog for monitoring. We already have alerts set for the number of messages in DLQ and for Lambda failures, but the "Lambda failures" only count those that completely fail. What I'm particularly interested in is a way to monitor the batch item failures (i.e., instances where a Lambda fails to successfully process most or all messages in a given batch).
Is there a built-in solution in AWS or DataDog for tracking the number of messages resulting in batch item failures? Any insights or custom solutions you all might be using would be greatly appreciated!
1 Answer
Hey! Great question! You should check out the `aws.lambda.enhanced.batch_item_failures` metric for your Lambda function. DataDog automatically creates this metric for functions where they can access the payload response. If you don't see it, just submit a support ticket to DataDog and they’ll help you out!

Sweet! This is exactly what I was looking for, but I don't have that metric available. I'll reach out to support, thanks!