Hey everyone! I'm diving into the world of Azure VMs and I'm used to troubleshooting on-premise setups. It seems like troubleshooting Azure VMs involves more factors, such as IOPS and size limits. Since I'm not working with Azure VMs at my job yet, I want to get familiar with the typical checks that are done. When someone mentions a VM is slow, what are the key metrics to monitor beyond just CPU and memory? Also, what should I do if I can't RDP into the VM?
1 Answer
When troubleshooting Azure VMs, I usually start by checking the standard metrics through Azure Monitor like CPU usage, memory pressure, disk latency, and network throughput. If those metrics seem normal, I then check extension health and any recent platform events, because a lot of issues stem from failed updates or extensions. I also recommend verifying network security groups and routing in case a recent change is affecting connectivity. Additionally, I created a quick health check script that compiles key metrics into a report, which makes troubleshooting under pressure a lot easier!

Thanks for the tips! I'm definitely going to look into that report strategy. Where exactly do you check for those "recent platform events"? And I totally get what you mean about the NSG rule changes being sneaky; is that info found in the activity log?