Trouble with RDS Proxy Connection Management

0
3
Asked By CuriousCoder99 On

I'm trying to troubleshoot an issue with my infrastructure setup that uses RDS Proxy. Here's the architecture we have: We're running a synchronous stack with API Gateway (using HTTP v2), which goes through an ALB, then Fargate (ECS), and finally connects to RDS Proxy that accesses the RDS database. We also have an asynchronous workflow where sync requests are sent to EventBridge/SQS and processed by Lambdas that often make external API calls and access the database via RDS Proxy. We're running into a problem where we're getting some 5xx errors on the synchronous side. Sometimes, Fargate takes too long to respond before the ALB times out, and other times, it's slow database queries - despite our attempts to optimize those.

What puzzles me is this: I've noticed that pinned Proxy connections directly correlate with borrowed connections, suggesting the proxy isn't multiplexing and is behaving just like a pass-through. Also, RDS Client connections (from Lambda/Fargate to RDS Proxy) are much lower compared to the Database connections (from RDS Proxy to RDS), indicating a lack of multiplexing or connection reuse. On top of that, CloudWatch reports max connections on RDS Proxy hovering around 500, but the database connections never go beyond 120. If we were hitting the 500 limit, that'd be an easy fix, but there's clearly room for scaling. Why isn't that happening? For added context, our connection_borrow_timeout is set to 120, max_connections_percent is 100, max_idle_connections_percent is 50, and session_pinning_filters includes "EXCLUDE_VARIABLE_SETS". I've heard that moving away from prepared statements might help reduce session pinning rates, but I still don't understand why we're not utilizing the available connections, leading to occasional 5xx errors when Lambdas can't acquire a connection.

2 Answers

Answered By TechWhiz42 On

Are you using MySQL? If so, it's crucial to ensure that the setup supports your connection usage effectively. Sometimes, multiplexing only kicks in during higher connection pressure. This means the proxy will only start reusing connections when it genuinely needs to, which could explain the behavior you're seeing with the connections not scaling. You might want to monitor the idle connections a bit more closely to see how often they hit that max threshold.

DatabaseGuru21 -

That makes sense, especially if your metrics are showing low connection usage. Have you checked for any spikes in connections that last less than a minute? Sometimes, these aren't visible in the minute-based metrics and could be causing the issues you're seeing.

Answered By CloudNinja88 On

It sounds like connection pressure is the key here. With your max_connections_percent and max_idle_connections_percent set low, it's possible the proxy isn't activating its multiplexing capabilities just yet. If your application's traffic varies, that could lead to throttling and cause timeouts when Lambdas request connections. It's worth checking if you're hitting those thresholds just before the errors occur.

CuriousCoder99 -

That's what I'm worried about! The errors seem random for now, but with more traffic, we might run into scaling issues. Waiting for things to get really busy before the proxy decides to act could be problematic!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.