I'm currently consulting for a large AWS customer in Europe that has a significant distributed system designed as a modular microlith mainly using Node.js. The setup includes several microservices, and each service comprises multiple business units loaded as modules. Since the workload is very sensitive to latency, we group modules that frequently communicate in the same microservice. At the moment, they are running about 5,000 to 6,000 Fargate instances, with interservice HTTP latency in the same zone averaging around 8-15 ms.
Is this latency range typical? What latency metrics do others see across containers? Additionally, what strategies could be implemented to help reduce this latency? I realize it's challenging to push for changes in a large organization, such as using placement groups, which has been stalled for two years. I'd appreciate any advice on how to address this latency issue effectively.
5 Answers
Latency within the same AZ is usually less than 1 ms! You should double-check if your business logic or overall service time is the real latency culprit. AWS even provides metrics on this, so it could give you insights into whether the latency is inherent to your setup or an indication of a deeper issue.
Your latency might not be out of the ordinary depending on how you're measuring it. Are the metrics averaged across instances, zones, or regions? If a big chunk of traffic is going inter-region, that could explain higher numbers. Also, with Node.js, it's crucial to consider how latency is being assessed. Relying on third-party JavaScript can add latency if it's not optimized. Some aspects of CPU contention could also be inflating your response times. So, knowing the context of your metrics is essential to pin down if there's a real issue or if it’s normal.
I’m only looking at latency within the same region, so that’s definitely something to keep in mind.
Have you validated that your latency measurements are accurate? Also, if your services are deployed across multiple AZs, how are you managing service calls to avoid crossing AZs? Proper configuration can greatly reduce latency.
For the 8-15 ms latency, we need more details. Are you using TLS? Usually, plain HTTP connections can be faster. Also, consider client-side load balancing instead of using a load balancer for better performance. If you're making calls across different availability zones, that can add latency too. I've seen setups where containers in the same AZ respond in less than 1 ms. So it’s worth investigating where those calls are going.
We’re using HTTPS for service calls. It seems like that could be a factor in the latency.
For a more accurate understanding, try setting up a simplified version of your architecture. Start with an empty setup to see the baseline latency, then progressively reintroduce your features to pinpoint where the delays begin to show up. It might help clarify whether you're dealing with a networking issue or some higher-level configuration.

Good point. I’ll look into those metrics since I didn’t know about them until now.