I'm running data pipelines on a shared Kubernetes cluster, and I process a large volume of data every hour. Sometimes, my jobs get stuck, and I'm curious if other teams' jobs on the same cluster, even if they don't overlap exactly with mine, could be impacting my performance. I want to understand how memory is allocated among pods on a single node and whether there's a chance that previous jobs' pods are still holding onto resources when mine try to start. Also, since my cluster is managed by a cloud service and I don't have direct access to kubectl, is there a way to check the real-time memory usage of my pods? I come from a managed compute background, so Kubernetes internals are quite new to me. I'm open to reading documentation but could use some guidance on where to begin.
4 Answers
If you're struggling with understanding how pods share memory, you might want to backtrack a bit. It sounds like you need to learn more about Docker, Linux Namespaces, and Control Groups. Understanding these fundamentals will really help you get the big picture of how memory management works in Kubernetes pods.
The memory on a node is divided using cgroups that enable setting memory requests and limits for each pod. The request is a guaranteed minimum, while the limit is the maximum threshold. If pods exceed this limit, they might be killed. Make sure to set these values appropriately in your pod definitions to ensure proper resource allocation. I recommend matching the memory limit to the request value for best practices.
Using AI can be helpful for learning basic concepts, but it's important to have a foundation first. You don’t want to rely too much on AI for understanding complex topics without having enough knowledge to discern correct information from mistakes.
To get a good grasp on this, it's key to understand how containers work. Have a look at some resources on Linux cgroups, as they are used for managing memory allocation and limits for your pods. For your job's performance issues, check if you and your teammates are setting resource requests and limits properly. This might be affecting your pods' performance. And as for monitoring, does your cloud service provide any monitoring tools? If not, it's worth discussing with the admin about possibly setting something up since tracking memory usage is crucial for performance.

Thanks! I haven't set those values in my YAML yet. I was thinking of trying a different schedule first, but if that doesn't help, I'll look into adjusting the configs for other projects.