Hey everyone! I recently discovered a new feature from CAST AI that allows live migration of running containers between nodes in EKS using CRIU. This piqued my interest, and I'm thinking about developing a Kubernetes operator to achieve the same functionality. Has anyone here worked on similar projects or have insights into how this works? I'm looking for tips, ideas, or suggestions and want to gauge the feasibility of building such an operator. Also, I'm curious why this capability isn't already integrated as a native feature in Kubernetes, as it seems like it could be quite beneficial for real-world applications.
6 Answers
There’s actually a working group focused on Checkpoint Restore in Kubernetes. It's an interesting area of development that could lead to more solutions down the road!
You might want to take a look at the zeropod project on GitHub. It also uses CRIU to manage container state. This could be a useful reference for your own operator.
You should definitely check out this podcast where they discuss the challenges of live migration in Kubernetes. They mentioned it took nearly a year to develop their solution, and CRIU is just one part of the puzzle. There’s a lot more to consider, like networking and storage. It looks like CAST AI is ahead of the game, and live migration can lead to some interesting use cases, like moving workloads between spot instances or managing resource-intensive tasks without downtime.
Thanks! I'll give it a listen right away.
I always assumed that live migration would be a standard feature. It's surprising that we don’t have it already!
This could be a game changer for jobs like Apache Spark that require long runtimes. Draining nodes for maintenance can disrupt processing, so live migration could be invaluable here.
Agreed! We run long-running Spark jobs, and avoiding state disruptions would be a huge advantage.
Is there really a need for this? It seems risky to keep containers alive that should be replaced regularly. But, I guess it could help with specific use cases like game servers or stateful workloads where you want to retain state without manually remounting volumes. Certain workloads, especially with local storage requirements, might benefit from seamless migration.
Exactly! Stateful applications often need that kind of flexibility. Plus, with Kubernetes evolving, these use cases may become more common.
Good point! In theory, it could help manage workloads better, especially with approaches like KubeVirt for VMs.

Thanks for the tip!