I've noticed that some of our users run pods that do heavy computational tasks. Our web app queries these pods and waits for a response. However, some queries are taking longer than the default 30-second timeout for pod ingress, so I increased the timeout to 60 seconds. Users are still experiencing timeouts, and they've now asked for an hour! This feels like it could lead to issues, but I'm not sure what the implications of having 1-hour ingress timeouts could be, especially with 3 to 10 pods running simultaneously. What are the potential downsides of this approach?
6 Answers
The design of your application's endpoint might need to change. Instead of making it synchronous, where the frontend waits for a response, consider an asynchronous approach. You could return immediately with a token for the request. Then, when the processing is finished, the backend can push an event back to the client via websocket or server-sent events. This way, you avoid keeping connections open unnecessarily, which is much more efficient and less resource-intensive.
One major concern is the number of TCP connections that your load balancer or proxy has to maintain. If a lot of pods have long timeouts, it means the proxy has to keep all those connections open, which can eat up resources like memory and file descriptors. This won't be a huge issue if your traffic is low, but with high request volumes, it could really impact performance. Additionally, if the proxy restarts, those long-running queries may still hold server resources without a client on the other end.
We have a 'slow' ingress setup for exactly this reason. Sometimes, it just takes a while for services to respond, and you don't want to write a bunch of complex polling logic. For those edge cases, wait times are understandable.
This setup sounds problematic in terms of design. Keeping a TCP connection open for so long goes against best practices, especially since there’s a built-in 5-minute timeout in TCP/IP. Switching to a websocket model with a task queue and handling retries would be a much better option.
Using sidecars can help manage this situation. It allows for better control over TLS rules and ingress management on a per-service basis, but be sure to have a solid change approval process in place.
Honestly, asking for a 1-hour timeout isn't the right move. If these tasks take that long, the frontend really needs to rethink how it handles responses. Polling for updates every few seconds could work and would allow for a better overall architecture instead of waiting for long responses that can lead to connection issues.
Very good points here! Having long sessions can be common; just look at websocket connections. They can be active for a very long time.