I've been having a frustrating issue where several of my EC2 instances keep becoming corrupted, rendering them inaccessible via SSH. This seems to have started when I was processing large datasets and ran out of disk space. I increased my drive size and split my data processing into smaller chunks to fix this. However, I've still encountered more corruptions while working on my Python code, and the last instance failed while I was troubleshooting a package component.
When I check the connection status in the console, I notice the message indicating that the SSM Agent is offline, which suggests it can't connect to the Systems Manager to register itself. I've tried rebooting and completely shutting down and restarting the instances, but that hasn't helped. Thankfully, my volumes aren't corrupted, so I can attach them to new instances, but it takes a lot of time to set everything up again.
I'm currently using a t3.large instance with a Deep Learning Base OSS NVIDIA Driver GPU AMI on Ubuntu 24.04. I'm wondering if anyone else has faced this issue and if you have any recommendations. I'm spending too much time rebuilding instances instead of actually working on my projects.
Edit: I managed to regain access by connecting through PowerShell and reinstalling VS Code Remote-SSH. I will keep an eye on resource usage to see if that leads to instance corruption.
4 Answers
If you have the EBS volumes, try mounting them on a different instance. That way you can see if the data is intact. Your instance might not actually be corrupted but just malfunctioning. It might also help to note that SSH and SSM work differently regarding networking. Overloading the instance can affect both ways you're trying to connect.
I’ve seen issues like this when instances run out of memory. Have you checked memory utilization? It could be that your t3 instance isn’t handling the load well.
Have you looked at your resource usage? From what you’ve described, it sounds like your instance might be running out of resources, which can lead to the issues you're seeing. Keep an eye on that as it could be the underlying cause.
You mentioned using a t3 instance for deep learning tasks, which might not be the best fit. T3s are more suited for lighter workloads. Maybe consider a more powerful instance type for your work.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures