I'm part of a small to medium-sized team managing multiple AWS accounts under one organization. We have quite a setup with over 100 SQS queues and SNS topics, various Lambdas, ECS services, and a few legacy bare-metal services, all while using API Gateway. When something goes wrong—like messages ending up in a Dead Letter Queue (DLQ) or encountering 5xx errors—it feels like a daunting task to trace back the issue. Our usual process involves logging into the relevant AWS account, finding the DLQ, determining the primary queue, and figuring out which producer sent the message, regardless of whether it's another account or an internal service. Sometimes switching between accounts or digging through logs takes forever. I'm looking for better solutions to streamline this process, so any advice would be really appreciated!
2 Answers
Centralized logging and trace solutions are crucial for managing complex architectures like yours. Consider using OpenTelemetry for tracking issues. It’s designed specifically for this kind of multi-account setup, making it easier to trace messages across various services.
You might find it helpful to implement a `correlationId` or `sessionId` across all your services. This ID should be generated at the onset of any process and used consistently through logs and messages. This way, when there's an issue, you can search for that ID in your logging tool and get a clear picture of the entire workflow leading up to the problem.

That sounds really useful! Do you happen to use CloudWatch for your logs? I'm considering that.