Why You Should Pay Attention to PSI Instead of Just CPU Usage

0
4
Asked By BreezyPenguin42 On

I've noticed an interesting difference in two Linux servers. Server A is running at nearly 100% CPU usage but has low latency and is processing requests quickly while encoding video. On the other hand, Server B is only at 40% CPU but is experiencing timeouts with API calls and lagging SSH connections. This situation made me realize that just looking at CPU graphs can be misleading. Server A may seem worse based on CPU percentage, but it's actually busy handling requests, whereas Server B is under pressure with tasks waiting for CPU time. It's common to see alerts that trigger when CPU usage exceeds 80% for over five minutes. However, CPU percentage alone doesn't indicate whether tasks are stuck; it merely reflects that cores are busy. Starting from Linux version 4.20, there's a feature called Pressure Stall Information (PSI) that provides better insight into how long tasks are stalled due to CPU, memory, or I/O issues. For instance, PSI can show that, in the last 60 seconds, tasks were stalled 5.23% of the time due to CPU unavailability. I've switched to using PSI for my observability project instead of load average, and it significantly reduced false alarms. I'm curious if anyone here is utilizing PSI in their production alerts.

5 Answers

Answered By ScalingSeeker987 On

Why not just scale based on API latency instead?

LatencyLover88 -

Latency won’t always improve with scaling, though. It’s sometimes better to scale before you notice latency issues popping up.

ThroughputThom -

Good point, some tasks might not even depend on API responses, like video encoding.

DatabaseDweller99 -

Exactly, if the bottleneck is in the database, just adding more servers can worsen latency due to more connections.

Answered By PracticalPam On

I'm glad you shared this! I honestly wasn’t aware of PSI before. It seems really useful for alerting, but it definitely requires a deeper look into the overall system behavior.

Answered By AnalyticalAndy On

This example is a bit odd because Server A would likely perform better with extra CPU resources. Just scaling up Server B won't help—it's more of a software problem. So, in relation to your CPU usage point, the anecdote seems a bit off.

DataDrivenDylan -

That's a fair point! Server A maxing out isn’t great, and just looking at CPU stats doesn’t necessarily mean you should scale without considering the workload and the underlying issues.

Answered By BusyBeeBill On

Yeah, we've started tracking CPU, memory, and disk stall metrics in our monitoring. Our time is stretched thin, though, so fixing the foundational problems isn't happening right now.

Answered By TechieTom123 On

PSI is great to include as an additional signal, but hitting 100% CPU utilization doesn't usually end well. If your system is constantly switching contexts or experiencing stalled I/O threads, you can lead to cascading failures with just one more request pushing it over the edge. If a process is timing out while only using 40% CPU, it indicates issues elsewhere. Things like single-threaded processes pegged at 100% or slow I/O could be behind those delays. It's critical to assess the overall health of the system based on multiple signals rather than just relying on CPU.

CuriousCoder56 -

I totally agree! I don’t think anyone is saying to ignore CPU usage altogether, but just focusing on that metric can be misleading. PSI gives a clearer picture of task stalling which is definitely more helpful in diagnosing issues.

SysAdminSally77 -

Exactly! If you're running at capacity and seeing heavy context switching, that's already a sign that your system is in trouble. It just proves that all signals need to be observed together.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.