I'm in the process of deploying CNPG in a multi-shard, multi-TB environment, and I'd love to hear some tested advice. We're setting up backups with S3 and will have two data centers, each with its own CNPG deployment connected through replica clusters as recommended in the documentation.
The architecture will feature 1-3 synchronous read replicas for each primary in both data centers. My main concern is how to automate the promotion of the secondary data center when the primary goes down. Currently, CNPG requires a manual step for this process, but we aim for an automated switchover. I realize that achieving RPO=0 might be challenging due to sync replication being relatively slow, but I'd appreciate any insights or alternative strategies to achieve something similar to the continuous replication recovery that RDS and GCP offer. Any experiences or advice would be greatly appreciated!
3 Answers
In our work environment, we decided on manual failover because losing an entire data center or availability zone is a scenario that typically warrants manual intervention anyway, so we've incorporated it into our disaster recovery plan. I personally wouldn’t want an automated switch to trigger erroneously, risking data inconsistencies in PostgreSQL.
Also, if you're open to alternatives, using MySQL with Vitess might allow you to integrate with Orchestrator for automatic failovers while keeping location awareness in mind, though I don't have any personal experience with that. To clarify, I believe you meant RTO=0 instead of RPO=0, since we actually maintain a very low RPO by streaming WALs directly to S3 storage.
You might want to consider writing a custom script that monitors your primary CNPG instance to automatically promote the replica when it detects an issue. However, keep in mind that a replica cluster might not be ideal for complete automation, as it's often designed for manual failover processes. If you have enough bandwidth and low latency, one option could be to run a single CNPG cluster that spans multiple data centers or to use a service mesh to connect separate clusters. This way, it can automatically promote a read replica in the event of a leader failure, although you'll need to manage latency carefully.
While I'm not a CNPG expert, I found a section on automated failover in their documentation that seems to cater to automatic processes. Maybe it could suit your needs, although I'm unsure how it fits with your cluster replica mode. It could be worth checking out!

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures