I'm working on a cost-effective chatbot API that uses an 80GB open-source AI model. My strategy is to launch a GPU instance only when there's incoming traffic, then shut it down after a brief period of inactivity to save on costs. However, I've been running into frequent "insufficient capacity" errors when trying to start my stopped p5.x4large instance in the Tokyo region, which makes this approach unreliable. I'm curious about how others manage to run AI inference APIs on AWS without constantly burning cash.
- Are you successfully using on-demand GPU instances with auto start/stop?
- Or do you just leave them running 24/7?
- Have you found any solutions for the EC2 capacity issues?
For reference, I've never faced this problem with other GPU cloud services where instances would reliably start as needed. I'd love to hear any tips or experiences you have!
1 Answer
To tackle the capacity issues, consider setting up a mixed instances policy in your launch template with a variety of instance types. This way, even if one type is unavailable, you can spin up an alternative. Using a message queue can also help—scale your instances based on the number of messages, and don't shut them down immediately; let them poll for new messages instead. Protect your instances from being shut down while they're still processing requests.

That makes sense! I was using a resource-heavy model too, and AWS didn’t offer a single GPU for A100, so I might have to look at models that require less memory to have more options. Thanks for the tip!