I've been dealing with a frustrating issue lately where multiple EC2 instances, around 5 or 6, have become corrupted in just a week. It started when I encountered space issues while processing large data sets, which I thought I fixed by increasing my drive size and chunking the processing tasks. However, the last few incidents happened while I was simply writing code in Python. I noticed a message saying 'SSM Agent is not online', indicating it couldn't connect to a Systems Manager endpoint. I've tried rebooting and shutting down completely, but nothing seems to work. The silver lining is that my volumes remain intact, so I can attach them to new instances. I'm using a t3.large instance with a specific Deep Learning AMI on Ubuntu 24.04. Has anyone experienced something similar? I'm really looking for advice since I'm wasting too much time setting up new instances instead of focusing on my work.
4 Answers
I see this kind of behavior often with instances that are out of memory. If the system runs out of available memory, it can lead to these kinds of crashes.
Considering you're on a t3 instance with a heavy workload, this could be your problem! T3 instances are great for saving money but not necessarily suited for heavy processing. They might be struggling to keep up with the demands you're placing on them.
It sounds like you might be running into resource limitations with your instances. Have you checked the resource utilization? From what you described, it seems like you could be overloading your instance and causing it to lock up during tasks.
Since you've got the EBS volumes still, you could try attaching them to a different instance to see what's going on with the data. The instance itself might be broken but not necessarily corrupted. Just remember that SSH and SSM have different networking requirements, which might also be part of the issue you're facing.

You're probably right! I noticed my last couple of instances crashed, but I wasn't doing anything particularly heavy at the time. I've managed to get back in by forcing VS Code's remote-SSH to reinstall, so I'll definitely keep an eye on my resource usage.