Hey folks, I'm currently running HashiCorp Vault on an AWS Auto Scaling Group and I'm dealing with some issues related to quorum loss when rotating nodes during version upgrades or other operational changes. The main problem I'm facing is that when an Auto Scaling Group (ASG) terminates nodes, the Raft peer list isn't automatically updated, which leaves stale peer entries and causes the cluster to lose quorum, even though the remaining nodes should suffice.
I've tried two main approaches so far: 1) Autopilot is a potential solution, but its recommended setting for dead_server_last_contact_threshold is 24 hours, which is too long for my needs. I need something more immediate for node rotations. 2) The second method I'm exploring involves using ASG Lifecycle Hooks to automate the removal of peers whenever a node enters the termination lifecycle. This way, I can clean up peer entries instantly rather than waiting for the Autopilot timeout.
I'm reaching out to see if anyone has implemented ASG lifecycle hooks for managing Vault peer entries and I'd love to hear about the implementation details—specifically how you handle coordination between the ASG termination hook and the peer removal process (API calls, scripts, Lambda, etc.). Also, are there other strategies I should consider for maintaining quorum during planned node rotations?
5 Answers
Upgrading Vault is pretty straightforward—it's usually just about updating the package and rebooting. Just ensure you upgrade the nodes in the correct order. Honestly, I wouldn't bother using ASG for this.
Have you considered whether you're using raw VMs? That could complicate things as fresh nodes would need a lot of work to sync with your cluster. Alternatively, have you thought about using Kubernetes? It can provide a much cleaner environment with stable hostnames and volumes which might simplify your setup.
For your lifecycle hook, I recommend setting it to a timeout of around 300 seconds. This should allow enough time to clean up dead peers before termination kicks in.
Why are you utilizing ASG for Vault in the first place? Typically, Vault doesn't experience massive fluctuations in load. It might be better to conduct proper capacity planning instead and manage resources to rely on simple rolling upgrades. If you're experiencing high load, it's worth addressing that first, as Vault should mainly focus on secret management, not act as a key-value store.
I totally understand your frustration. I used to work with a 3-service ECS cluster where each service had a stable hostname. So whenever anything went down, a new task with the same name would come up and rejoin the cluster without issues.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures