Hey everyone, I'm running a couple of .NET 8 workloads on AWS EKS and I've been hitting some serious issues with them getting OOM killed because they exceed the RAM limits. The problem is that my workloads have sporadic loads; they mostly sit around 3Gi of RAM but can spike over 9Gi when traffic comes in. I've isolated these workloads in Fargate to prevent them from impacting others. I suspect the garbage collector in .NET is trying to use all the free RAM available, but I'm not sure how to handle it best. Any recommendations?
5 Answers
You might want to check out Kubernetes' feature for autoscaling memory without restarting the pods. I think it’s in beta for version 1.33, which could alleviate some of your problems.
Without more details, I think you might have a memory leak in your code. It could also help to have more pods ready or use KEDA for scaling, especially if this is a common issue. Implementing a message queue could manage the load better by processing jobs sequentially.
In theory, .NET 8 should work fine with Kubernetes and should throw errors as it approaches memory limits. If your Kubernetes setup is outdated, or if you have unusual .NET settings, that might be the issue. Remember, Kubernetes isn’t great with unpredictable loads; finding the job that causes spikes and isolating it might be your best bet.
Have you looked into using open telemetry? There might be auto-instrumentation you can leverage. K8s generally prefers to scale horizontally, creating more pod replicas instead of scaling memory vertically. You could look into vertical HPA for recommendations. Discuss the possibility of using an AWS SQS queue with the developers to handle job loads better.
.NET 8 doesn't recognize cgroup memory limits well, which could be a major issue on Fargate. You might want to upgrade to .NET 9 or use the DOTNET_GCHeapHardLimit environment variable to set a cap on memory usage. That should help to manage the resources better.
I tried setting that environment variable, but it didn't help at all. I guess upgrading to .NET 9 might be the next step.
What improvements does .NET 9 offer that might help with the OOM issues?
That’s what I’m concerned about; the devs get anxious about manually controlling the garbage collector. It might be a memory leak, or it could involve using GraphQL caching. I’m considering enabling a swap file too.