I'm currently using RunPod to serve AI models, but their serverless option has been too unreliable for my production needs. I know AWS doesn't provide serverless GPU computing out of the box, so I'm wondering if I can set up a solution where a Lambda function triggers an EC2 or Spot instance to run a FastAPI server for inference, then automatically shuts down the instance after I get the response. I need this to work for multiple users at the same time. My plan is to use Boto3 for this setup. Is this a workable solution, or is there a better approach I should consider?
4 Answers
I've had customers ask for a RunPod-like experience on AWS, but without always-on servers, it’s tough. GPU availability is a huge concern; you can't guarantee on-demand. Spot instances make it trickier. To improve speed, consider a pub-sub architecture where the front-end posts a message, and a worker processes it and sends back the results. I’ve been using EKS with HPA and Karpenter for similar tasks. The HPA triggers workers based on queue metrics, and Karpenter scales based on what's available. This might help you avoid capacity issues.
Starting an EC2 instance can take a long time, so your users might get frustrated with the wait. If they're expecting quick responses, you might want to reconsider the timing aspect.
If you’re using Spot instances, are you really going serverless? Serverless typically means not managing virtual machines. Even though you might only be handling the EC2 instances temporarily, it's still more management than a true serverless setup. But hey, it’s just semantics, right?
You could have your API server send a message to SQS, then use EventBridge to trigger ECS jobs. ECS can utilize GPUs as well, and this means you only deploy infrastructure when it's needed. Good luck with it!
Starting, executing one call, and shutting down an EC2 might not be considered full management, haha.