Multi-DC for high availability Innovation Release

Deploying Hybrid Manager (HM) across multiple data centers (multi-DC) provides high availability and disaster recovery for your Postgres database clusters. This setup allows you to manage databases across different geographical locations from a single interface.

Important

This excludes the recovery of the HM installation itself, which you must back up and restore separately.

Why run HM across multiple data centers?

Deploying HM across multiple data centers gives you high availability and disaster recovery for Postgres database clusters. It allows you to:

  • Survive a database loss (DR) Keep one or two warm secondary Postgres databases ready. If the primary data center is unavailable, promote replicas in the secondary location and restore services.

  • Minimize downtime (RTO) Leverage multiple, geographically distributed, write endpoints (with DHA Postgres clusters) or failover quickly (with single-node, HA, and AHA Postgres clusters) to rapidly restore services during maintenance or unplanned site outages.

  • Protect data (RPO) Continuous replication to a second data center reduces potential data loss compared to single-site backups only.

  • Reduce blast radius Faults, misconfigurations, or noisy neighbors in one data center don’t take down the other.

  • Meet compliance/sovereignty Keep copies in a specific region or facility while still centralizing control.

  • Operate at scale Split read traffic, stage upgrades, or run blue/green cutovers across data centers.

  • Centralized management A single primary HM manages databases across all remote Kubernetes clusters, simplifying administration while providing geo-redundancy.

Architecture and roles: Hub and Spoke

HM uses a Hub and Spoke model to manage multiple data centers. It is important to distinguish between the control plane (the management software) and the data plane (your actual Postgres databases).

The primary (The Hub)

The primary location is the "source of truth" and the only site where the full HM control plane is active. The following processes are owned by the primary or hub:

  • HM console (UI): Your "single pane of glass." You log into the primary's URL to manage databases across all regions.

  • Central scheduler: The primary decides where database nodes should live and sends provisioning instructions to the secondaries.

  • Identity provider: Acts as the root of trust for SPIRE, issuing secure identities to all secondary locations.

  • Observability hub: Uses Thanos and Loki to aggregate metrics and logs from every location into a single view.

The secondaries (The Spokes)

Secondaries are "lean" installations designed for execution rather than management. They don't run their own independent consoles, but they do run components that facilitate the communication and management from the primary or hub:

  • HM-internal Beacon agent: Instead of a UI, secondaries run the HM-internal Beacon agent. This agent heartbeats back to the primary, making local Kubernetes resources (nodes, storage) available to the central scheduler.

  • Local execution: When you deploy a cluster to a secondary location, the primary instructs the secondary’s local operators to provision the Postgres pods.

Data plane & storage

While the control plane is Hub-and-Spoke, your data stays synchronized across the environment:

  • Postgres clusters: You can deploy a primary Postgres cluster in DC-A and asynchronous Replicas in DC-B and DC-C. For Distributed High Availability (PGD), you can spread Data Groups and Witness nodes across all three locations.

  • Unified storage: All locations must share a consistent object store configuration. This ensures that backups and WAL files generated in one data center are immediately available for recovery in another.

Topologies

You can deploy the multi-DC setup with the following topologies:

  • Primary and data-only HM: This is the basic setup with one primary HM and at least one data-only HM. It allows for Postgres cluster deployments across two locations.

  • Consolidated management with multiple data-only HMs: Manage databases across multiple remote Kubernetes clusters from a single primary HM. This architecture provides a unified control plane while maintaining a lightweight, data-only footprint in each target location.

  • Three HMs for PGD: For optimal PGD deployments (Distributed and Advanced High Availability Postgres clusters), especially with two data groups and a witness group, we recommend a three-location topology. This configuration ensures quorum for proper distributed consensus.

Resilience and failover

Because you can mix and match different database topologies across your data centers, the impact of a site failure varies. Failover is performed per database cluster, not for the entire Hybrid Manager environment at once. Here are some examples:

Database topologySetupRecovery actionRTO (Time)RPO (Data)
Single-NodeNo replicas, located in DC-A.Full restore: Must recreate the cluster from object storage/backup.High: depends on DB size and restore speed.Medium: Up to the last backup/WAL archive.
Single-Node + ReplicaSingle-node in DC-A, replica in DC-B.Manual promotion: Promote the replica in DC-B to a standalone primary.Low: is driven by runbook and DNS updates.Very low: Seconds (based on async replication lag).
HA clusterPrimary/Replicas in DC-A, replica in DC-B.Manual promotion: Promote the cross-DC replica in DC-B to primary.Low: is driven by runbook and DNS updates.Very low: Seconds (based on async replication lag).
DHA clusterData Groups in DC-A and DC-B, witness in DC-C.Automatic: PGD maintains quorum; surviving nodes stay writable.Near zero: Instant, limited only by DNS/LB TTL.Zero: If using sync commit, otherwise near-zero.

The differences in the recovery depend on both the multi-DC setup as well as the topology you have chosen for each specific database cluster.

Local HA vs. Multi-DC Failover: From a disaster recovery perspective, both single-instance and HA clusters require a promotion to a secondary location if the primary site fails. However, an HA cluster (with local replicas) provides internal resilience; it can automatically recover from a primary instance failure within the same data center. A single-instance deployment, lacking a local standby, requires the database to completely restart or undergo a full cross-location failover to recover from any local interruption.

RTO (Recovery Time Objective): This is your downtime. For any manual promotion (Single-node + Replica or HA cluster), your RTO is limited by how fast you can run your failover playbook.

RPO (Recovery Point Objective): This is your data loss. For async replication, which is standard for multi-DC, this typically results in an RPO of a few seconds. In the case of sync replication, you can achieve zero RPO, but this adds latency to every write as it waits for the remote data center to acknowledge the data.

Network latency in Distributed High Availability (DHA) clusters

There's an inherent trade-off between how quickly Distributed High Availability clusters (powered by PGD) can react to consensus issues and the amount of network latency present. Higher latency requires more relaxed consensus settings, potentially increasing the time to detect and respond to failures. Some Raft settings may not be directly configurable through the HM console.

Latency is less of a concern in on-premises, tightly coupled environments compared to cloud environments that span regions. You can configure more relaxed Raft settings to accommodate high-latency scenarios.

You can use the pg CLI to interact with the cluster and adjust configuration settings that aren't exposed in the console, particularly Raft settings. Use the bdr run on nodes function to execute commands on other nodes.

HM backups and DR for multi-DC environments

HM uses two methods to protect your data. In a multi-DC setup, these roles are specific:

Barman (Cross-DC): Backups are stored in a global object store. This is your primary tool for disaster recovery across locations.

Volume snapshots (Local): Disk-based backups local to a specific data center. Use these for fast "point-in-time" recovery within a healthy site.

To ensure you can recover both your Hybrid Manager control plane and your Postgres databases during a site outage, verify the following:

  • Ensure your primary HM metadata (configurations, users, and cluster definitions) is backed up and stored outside the primary data center.

  • All HM instances (primary and secondaries) must have an identical edb-object-storage secret to access the shared backups live.

  • You must configure automated backup schedules for each location individually, settings don't sync automatically between sites.

  • Your organization knows how to restore a database using the CLI if the HM console is down due to a primary site's outage.

Current limitations

These are the current limitations for multi-DC deployments:

  • Up to three locations (primary and one secondary, or a primary and two secondaries).

  • Manual failover: Promote replicas in one of the secondary locations if the primary is down.

  • Same cloud/on-prem family: Cross-CSP multi-DC is not supported.