How Can I Run Stable GPU Inference on AWS Without Spot Interruptions?

0
1
Asked By CoolGalaxy123 On

I'm working with a small team on large language model inference using AWS, but we're struggling with frequent spot instance interruptions. We've tried various solutions like capacity-optimized ASGs and fallback strategies, as well as checkpointing, but these still fail when low latency is crucial. Reserved Instances don't provide the flexibility we need, and on-demand pricing is tough on our budget. Is there a reliable way to use AWS that allows us to keep our workloads stable while also being mindful of costs?

6 Answers

Answered By DevGuru_22 On

Could you explain how your current setup manages eviction notices? It sounds like you'd need a system that can migrate your tasks before those notices hit. You might explore AWS FIS to simulate evictions and check out Ray for Kubernetes (KubeRay). That framework has features like fault-tolerant task APIs, checkpointing for long-running tasks, and the ability to save progress when your setup gets a termination signal. You can also tweak things like `terminationGracePeriodSeconds` to give your pods some time to handle shutdown.

Answered By DataDrivenSteve On

Here's my take: If you can't tolerate interruptions at all, then relying on spot instances may not be the best choice. They're not a one-size-fits-all solution.

Answered By CloudNinja99 On

Are you trying to reserve just one type of instance or multiple? If you request a variety of instances that can do the job, it could improve your chances. Just keep in mind with the AI boom, finding affordable GPU options is a real challenge!

Answered By CodeMasterX On

If you're using SQS events, they should be able to manage interruptions before you get fully impacted. That's one strategy to check out!

Answered By SysAdminJoe On

Why not consider adding Kubernetes to your approach? That five-minute warning you get is usually sufficient to scale up and migrate workloads to a new node.

Answered By TechWhiz87 On

Honestly, I think there might not be an ideal solution. We've transitioned most of our GPU tasks to Lambda Labs since it feels like cloud prices are just skyrocketing for GPUs lately.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.