How do you effectively monitor dead letter queues?

0
12
Asked By TechSavvyDude42 On

I'm currently working with SQS in production and honestly, the dead letter queue (DLQ) management is a total mess. I've got a CloudWatch alarm set up, but a lot of my team doesn't seem to trust it, and we've faced issues with messages stacking up unnoticed. I've talked to a few people recently, and it seems like no two teams handle this the same way. Some are using Lambda functions to monitor and send alerts, while others just check them manually (definitely not ideal). A few have integrated it with Datadog but then complain about the expenses. I'm just wondering, what solutions are you using? Is there a practical approach I'm missing, or is everyone just dealing with their own makeshift fixes?

4 Answers

Answered By SQSExpert On

Have you thought about setting a proper message expiration time? That way, your DLQ can self-manage to some extent, which might help avoid the buildup of unprocessed messages.

Answered By HelpfulTechie On

The distinction between ApproximateNumber and NumberOfMessagesSent is crucial! We messed that up as well. I hadn't considered the retention period mismatch either. I really wish these things were pre-configured out of the box!

Answered By BudgetManager On

Definitely feels like there should be a better solution than just shelling out a ton of cash for Datadog, especially for smaller teams like ours.

Answered By DataWizard123 On

We use Datadog too, but since we also need it for security information and event management (SIEM), we're only collecting logs once and splitting the cost with our security team. Datadog can be pricey, but you can manage the costs. Here are a few tips: 1) Only send what you really need to minimize incoming data, and drop unnecessary stuff at the index level to save on indexing costs. 2) Keep logs for a shorter duration if they’re just for alerts—keeping logs only for 3 days can save money. 3) Consider a one-year contract for a lower rate if your usage is steady. The default retention is 30 days, which can get expensive, so set a new, shorter default index for alerts.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.