I'm currently managing a Python service on AWS ECS that facilitates AI agent conversations using langchain. The issue at hand is that some discussions can extend to 30 minutes or more when the agent is deeply processing information. However, when I initiate a deployment of a new version, ECS abruptly terminates the old container mid-conversation, much to the displeasure of my users who often wait a long time for responses.
Here's my setup:
- A single ECS task utilizing Service Discovery (AWS Cloud Map).
- Rolling deployments, with Blue/Green deployments being blocked because of Service Discovery.
- The stopTimeout is set to a maximum of 120 seconds, which isn't nearly enough time.
I'm looking for suggestions on how other developers manage similar services without complicating the deployment process too much. Any advice?
3 Answers
We faced a similar situation at BlueTalon with lengthy batch processing. One effective strategy was to implement a drain mode for our service. Essentially, this meant we stopped accepting new requests while continuing to process existing ones. We set up a special health check endpoint that indicated to the load balancer that the service was still active but should not receive new tasks. This allowed our deployment script to wait until all active jobs were finished before shutting down the container. It requires some extra setup but really helps maintain service without disrupting user interactions!
When a container receives a SIGTERM signal, that's your cue to gracefully shut it down. In ECS, you have a small window to manage this. You can extend the timeout past 120 seconds if you're using FARGATE, as there might be settings you can tune. Also, consider off-peak deployments to reduce disruptions or switch to an event-driven architecture where lengthy tasks are handled independently.
It's crucial to consider conversation data storage. If you're not saving conversation states somewhere, that's a major issue in design. You could store the conversation in S3 or a database. However, even if you have checkpoints, the problem remains, especially if the SIGTERM signal interrupts your agent's response process. So, the critical point is ensuring you handle the conversation state effectively during deployments.

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically