I'm about to upgrade my EKS cluster from version 1.31 to 1.32 and also transition my node groups from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023). This is for a large production environment with 12 m5.xlarge nodes, so I need to tread carefully. I've got a few questions for anyone who's already gone through this process:
- Did you face any issues or unexpected errors during your upgrade?
- Were there any specific quirks with AL2023 or problems related to CNI and networking, or issues with daemonsets?
- Are there any notable differences in the kernel, systemd, or containerd that I should watch for?
- Is there anything you wish you had known before starting the upgrade?
I'm trying to avoid any surprises during the rollout. Appreciate your insights!
5 Answers
A good approach would be to create a new node group and gradually migrate your applications over. You can scale down the old group to zero, monitor everything for a few days, then remove it completely. Just be cautious about any pods that might depend on specific permanent EBS volumes, as migrating to a new instance in a different availability zone could cause problems. Also, make sure the new instances have the right security group setups.
As someone who's not a pro, I'd recommend checking AWS's resources for the upgrade. The main challenge seems to stem from obsolete Kubernetes API versions rather than the AL2 to AL3 switch. Utilizing tools to spot these deprecated APIs would be crucial. And honestly, I don’t get why more companies don't set up a blue/green deployment strategy for such transitions—it could really save a lot of headaches!
I think you'll want to keep an eye on resource utilization since I've heard reports of increased CPU usage on AL2023. It's kind of surprising to see people still using m5 instances, but make sure your setup is optimized first!
I recently did both upgrades on a pretty big cluster—around 150 to 300 nodes—and didn't run into any major issues, so you might be in good shape!
For the switch from AL2 to AL2023, we had no issues at all. As for your question about cgroupv2, keep in mind that older Java versions can behave strangely with it, so you might see higher memory usage in your pods compared to AL2—but this is expected. Also, definitely check for deprecated APIs in Kubernetes 1.32 to avoid running into issues down the road!

And don't forget that AL2023 disables IMDSv1 by default and sets the metadata hop count to 1, which affects permissions if your pods are utilizing the instance role. You might want to look into IRSA or adjust the launch template accordingly.