I'm working with a small team on large language model inference using AWS, but we're struggling with frequent spot instance interruptions. We've tried various solutions like capacity-optimized ASGs and fallback strategies, as well as checkpointing, but these still fail when low latency is crucial. Reserved Instances don't provide the flexibility we need, and on-demand pricing is tough on our budget. Is there a reliable way to use AWS that allows us to keep our workloads stable while also being mindful of costs?
6 Answers
Could you explain how your current setup manages eviction notices? It sounds like you'd need a system that can migrate your tasks before those notices hit. You might explore AWS FIS to simulate evictions and check out Ray for Kubernetes (KubeRay). That framework has features like fault-tolerant task APIs, checkpointing for long-running tasks, and the ability to save progress when your setup gets a termination signal. You can also tweak things like `terminationGracePeriodSeconds` to give your pods some time to handle shutdown.
Here's my take: If you can't tolerate interruptions at all, then relying on spot instances may not be the best choice. They're not a one-size-fits-all solution.
Are you trying to reserve just one type of instance or multiple? If you request a variety of instances that can do the job, it could improve your chances. Just keep in mind with the AI boom, finding affordable GPU options is a real challenge!
If you're using SQS events, they should be able to manage interruptions before you get fully impacted. That's one strategy to check out!
Why not consider adding Kubernetes to your approach? That five-minute warning you get is usually sufficient to scale up and migrate workloads to a new node.
Honestly, I think there might not be an ideal solution. We've transitioned most of our GPU tasks to Lambda Labs since it feels like cloud prices are just skyrocketing for GPUs lately.
Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures