Hey folks, I'm looking for some insights from DBAs or anyone experienced with AWS RDS Aurora MySQL. We recently faced a major incident that I'm trying to understand to prevent it from happening again. I ran two queries from IntelliJ: a `CREATE INDEX` on a table with around **10 million** rows that took about 44 minutes and reported as successful, followed by an `UPDATE` on **3 million** rows that ran for 2 hours, also showing as successful. The fallout was intense: we started getting alerts the next morning with all database connections failing, high CPU usage, and thousands of active sessions. We later discovered that the `CREATE INDEX` query never actually completed, which caused the `UPDATE` to also not run, leading to my questions: Why show as successful if they didn't complete? And what should a DBA do in situations like this to fix the database and debug what's going on? Any advice would be greatly appreciated!
5 Answers
I’m not a DBA, but have you checked whether the index even got created? Sometimes, things can appear successful, but there could be a silent failure in the background.
To troubleshoot, try taking a snapshot of your Aurora MySQL cluster and reproduce the issue there. Use `SHOW PROCESSLIST` to see what the `CREATE INDEX` was doing while it was running. If neither of these methods works, set up alarms for long-running queries so you don’t experience another outage like that again. Also, the advice about using lambda functions for alerts is pretty useful!
It sounds like the index creation might have been waiting for a lock that never got released, and those 44 minutes could have just been a timeout. IntelliJ reporting success could be misleading since the operation was likely still in progress. You might want to verify that the index was created afterwards, as there should be a way to validate it before running the `UPDATE`. Also, consider monitoring your database locks more closely in the future.
That’s a solid point! Definitely validate the index after running such heavy commands.
You mentioned high CPU usage—what instance class are you using? That seems super slow to create an index on 10 million rows unless the instance is significantly undersized.
Yeah, a larger instance can definitely speed up those operations!
It sounds like you might have been hit by a metadata lock issue, or your queries just never committed correctly. In your setup, failing over was a good move, but normally you’d want to check active locks first. Next time, consider setting up alarms for long-running queries, so you can catch these issues before they escalate overnight!

Good call! Creating a test environment will help figure out what went wrong while avoiding issues on the live system.