I'm experiencing intermittent 502 errors with my API setup that includes an API Gateway (HTTP v2) directing traffic to an Application Load Balancer (ALB), which then routes it to my ECS Fargate service. The errors seem to occur about 0.5% of the time, especially during peak traffic periods. My backend workload is a NodeJS API that interacts with an RDS Aurora database. To address these errors, I've already optimized slow queries, upgraded my RDS instance, removed the RDS Proxy to connect directly to the Aurora cluster, and increased my ECS task sizes, yet the errors persist. Interestingly, there are no corresponding logs in the ECS service for these 502 errors, and they don't appear linked to CPU, memory, or database usage spikes. Here's a sample APIG log entry and its corresponding ALB log entry for your reference.
4 Answers
I dealt with a similar issue using Node.js clusters. Occasionally, an uncaught error would kill one of the threads, causing the cluster to spawn a new thread. However, requests would come in before this new thread was fully ready, leading to 502 errors without any logs. Have you had any thread failures reported in your logs?
Have you checked for any scaling or draining events in your service? Sometimes those can affect connectivity without showing obvious signs.
No, I ruled that out first. There's no auto scaling, and we deploy at a set time weekly, so there's no correlation with those 5xx errors.
If you look at your target group monitoring, do those 5xx errors show up there? If they don’t, that indicates the requests aren't even reaching your container. Remember, the flow should be ALB -> target group -> containers.
I faced something similar with a Flask app behind Gunicorn. It turned out that if your application has a shorter keep-alive timeout than the ALB's (which is 60 seconds by default), your app might close the connection before the ALB knows it's closed. When the ALB tries to use a timed-out connection, it throws a 502 error to the client while your app doesn’t log anything. Setting the application keep-alive timeout to 65 seconds fixed it.

Are there specific log entries I should keep an eye out for, like a shutdown message or initialization logs for Node.js?