I'm working at a small EDA company where we share a login server mainly for tasks like VNC, file editing, and sometimes running CPU-intensive processes locally instead of on our compute farm. Our servers are equipped with 64-core Epyc 9354 CPUs, and come with either 500GB or 1.5TB of memory, with 250GB of swap. Currently, the swappiness is set to 10, and we generally have about 10-20 users on each server that runs on CentOS 7.
However, I've noticed that user processes occasionally go haywire and consume excessive memory. To mitigate this, I have `earlyoom` installed, but for some reason, it sometimes fails to kill the offending processes. When that happens, the system becomes unresponsive for hours until it crashes or recovers on its own.
Here are my main questions:
1. Should we have swap configured at all, or is it better to have no swap?
2. If we do use swap, what should the swappiness value be?
I suspect that the machine might not be aggressive enough in managing swap, which causes memory to max out before earlyoom can react. Should we opt for no swap, or should we tweak the swappiness to make it more aggressive?
5 Answers
Just a heads up, you might get a ton of responses from folks who don't really grasp what swap is meant for. It's actually a tool to help manage memory better by moving infrequently accessed data out of RAM, which allows the remaining RAM to be used more effectively. But it really depends on your workload. For instance, my laptop doesn't use swap at all, while I've dealt with servers where swap was used more than physical RAM, mainly due to Java processes. What works best can really vary, so you'll need to test based on your specific needs.
Exactly! Swapping out data isn't the biggest issue—it's the constant swapping in that can slow things down. If your system is hitting page faults often, that's when you really start to feel the pain.
A bit of a controversial opinion, but I’d advise getting rid of swap and consider using something like zswap instead, which compresses memory. Plus, setting per-user memory quotas can help keep heavy memory users in check without constantly hitting OOM errors. Your servers are powerful enough that relying on swap might not be necessary, especially with 1.5TB of RAM available.
In my opinion, it's probably best to avoid swap altogether—it can put a strain on your I/O operations and cut down on SSD performance. Instead, using cgroups to limit user resource allocations could work well. Check out the systemd documentation for settings that might help manage resources better under high load.
That makes sense! One issue we're running into is being on CentOS 7 and cgroups v1; it's a bit limited compared to v2, but I'll look into how I can optimize within those constraints.
If users are causing processes to consume too much memory, it might be better to put resource limits in place rather than dealing with OOM situations that can crash the whole system. You should generally have swap configured to avoid crashes, as it allows for more graceful degradation of performance instead of just locking up. But if performance is your priority and you want to avoid the risk of crashing, consider not having any swap at all. As for swappiness, it really depends on your specific storage solution; you might want to test different settings based on your typical workloads.
Totally! I’ve also found using a watchdog can be a better solution than relying on OOM, especially when things get stuck.
Honestly, I would say no swap. You’ve got a beefy server; there's really no business case for maintaining swap given the cost of your machine. Just max out the RAM based on your expected workload instead and avoid the additional complexity of managing swap effectively.
Thanks for your insight! Yeah, we've definitely run into issues with Java consuming lots of memory too. The advice on swap seems all over the place, especially since in the past, we used to set it equal to system memory, but that doesn’t hold with our newer machines.