HA/DR planning and best practices v1.3.2
Overview
You can configure a multi-data center topology using Hybrid Manager (HM) using EDB PostgreSQL Distributed (PGD) deployments to meet your high availability and disaster recovery (HA/DR) requirements. This topology includes a primary HM and one or more data-only HMs across multiple Kubernetes clusters.
In this HA environment, the primary HM is a standard deployment. The data-only HM is essentially the same HM deployment, with backend adjustments to connect it to the primary HM and to prevent accidental local provisioning. A key component facilitating this connection is the Agent and Beacon server, which handle communication and certificate passing between HMs.
In the event that one cluster goes down, your applications can be pointed to another cluster with synchronized data, ensuring continuous operation. A multi-data center setup provides redundancy in case of a data center outage. A central HM can manage databases across multiple remote Kubernetes clusters, simplifying database administration.
Topologies
You can deploy the multi-HM setup with the following topologies:
- Primary and data-only HM: This is the basic setup with one primary HM and at least one data-only HM. It allows for PGD deployments across two or more Kubernetes clusters.
- Multiple data-only HMs: The system can support multiple data-only HMs, allowing a single primary HM to manage databases across many Kubernetes clusters. This is useful for organizations with numerous application teams, each with its own cluster.
- Three HMs for PGD: For optimal PGD deployments, especially with two data groups and a witness group, we recommend a three-HM topology. This configuration ensures an odd quorum for proper distributed consensus.
Multi-data center deployments include a primary data center containing the main HM instance and a secondary data center potentially with a data-only HM or other Kubernetes clusters, along with a global object store accessible from multiple data centers.
Note
HM version 1.3 supports deploying multiple HMs in different locations, with a limit of two locations. Other advanced topologies are planned for future versions.
Note
Cross data center HA/DR: High availability and disaster recovery for the control plane across data centers or regions isn't currently supported. Disaster recovery via backup and restore is the only option.
Requirements
The following requirements apply when configuring an HA topology:
- Two or more existing Kubernetes clusters, for example, using Amazon EKS or RedHat OpenShift.
- A fully deployed primary HM on one of the Kubernetes clusters.
- A data-only HM installed on the other Kubernetes clusters, using a provided script for configuration.
- When spanning databases across the multiple clusters, you must use PGD.
- Witness nodes in PGD require a minimum of 10GB of disk space.
- A high-performance networking layer in cloud and multi-region deployments. EKS simplifies this due to its integration with the AWS ecosystem. OpenShift may require more complex networking configurations. Complex networking configurations, especially in OpenShift, can lead to connectivity problems between clusters. Ensure a proper network setup to allow communication between PGD nodes. High network latency can cause PGD clusters to become unstable if Raft settings aren't properly tuned. Symptoms may include slow failure detection and recovery. Adjust Raft-related settings to accommodate higher network latency.
Network latency
There's an inherent trade-off between how quickly PGD can react to consensus issues and the amount of network latency present. Higher latency requires more relaxed consensus settings, potentially increasing the time to detect and respond to failures. Some Raft settings may not be directly configurable through the HM console.
Latency is less of a concern in on-premises, tightly coupled environments compared to cloud environments that span regions. You can configure more relaxed Raft settings to accommodate high-latency scenarios.
You can use the pg CLI to interact with the cluster and adjust configuration settings that aren't exposed in the console, particularly Raft settings. Use the bdr run on nodes function to execute commands on other nodes.
Kubernetes and failovers
Kubernetes handles failovers and redundancy within HM. The following configurations are supported:
- Single cluster, single zone: A basic Kubernetes setup with one cluster and one zone, containing multiple nodes and pods.
- Single cluster, multiple zones: A Kubernetes cluster divided into multiple zones, where a zone could represent a separate rack or data center. This is referred to as a stretch cluster
- Multi-cluster, multi-location: Multiple Kubernetes clusters located in different data centers, potentially geographically distant.
These configurations achieve an HA posture by ensuring continuous service availability within a cluster or across zones through redundancy and failover mechanisms. DR is achieved by recovering services and data in the event of a larger failure, such as a data center outage, using backups and data replication across different locations.
Recommendations
The following recommendation apply when using Kubernetes:
- While HM aims to simplify HA/DR, a basic understanding of Kubernetes concepts is beneficial. Understanding Kubernetes concepts like nodes, pods, zones, and clusters is essential for HA/DR planning.
- Adequate resources within the Kubernetes cluster are necessary for pods to be rescheduled in case of node failures.
- Separate HM and PostgreSQL workloads onto different worker nodes.
- Careful planning for storage is crucial, including the type of storage (local, shared, object), its location, redundancy, and backup/restore procedures. We highly recommend Object storage for disaster recovery scenarios, especially for data center failures.
- For stretch clusters, network latency must be within acceptable limits and account for witness node placement. For example, OpenShift recommends less than 5 milliseconds.
Considerations
Keep in mind the following considerations when using Kubernetes:
- If worker nodes use local storage, data isn't automatically failed over when a node fails. Recovery requires restoring from backup.
- Utilizing appropriate storage solutions like shared storage (such as Ceph and SAN) or object storage is critical for data persistence and recovery.
- HM services might not automatically fail over in a multi-zone scenario if their data is tied to a specific zone.
- Data availability in failover scenarios depends on the location and replication of the storage.
- Kubernetes can handle pod rescheduling in the event of node failures, but data recovery depends on the storage configuration.
- Failover to another zone or data center requires careful planning for data replication and application redirection.
- Lack of shared or replicated storage can hinder recovery.
- HM leverages Kubernetes capabilities for HA, but it's important to consider HM-specific configurations.
HM backups and DR
HM multi-data center environments use the following to achieve a robust data protection strategy:
- Volume snapshots: Disk-based backups, typically local to the data center.
- Barman: Object storage-based backups, which can be stored in a global object store.
Recommendations
Keep in mind the following recommendations regarding HM backups:
- We recommend a global object storage solution for storing Barman backups. This approach can ensure availability of backups across data centers.
- Follow the 3-2-1 backup strategy (three copies of data, two different media, one offsite copy) by using both volume snapshots and Barman backups.
- Consider replicating backup metadata across Kubernetes clusters to simplify restores in disaster recovery scenarios.
- Implement a feature to automate both Barman and volume snapshot backups.
- Develop a disaster recovery plan for HM in addition to your PostgreSQL databases.
- When working with on-premises deployments, consider network configurations for connecting multiple Red Hat OpenShift instances.
Considerations
Keep in mind the following considerations regarding HM backups:
- Restoring backups when HM is unavailable in a secondary data center isn't trivial.
- Backup metadata is stored within the Kubernetes cluster where the backup is initiated. If that cluster is lost, access to the metadata is also lost, complicating restores.
- Restoring backups in a secondary data center without the primary HM may require manual steps.
- A primary HM instance is required for managing backups and restores.
- Automated backups can be sent to only one location. The system doesn't automatically replicate backup definitions across locations.
- You can set up Barman or volume snapshots for automatic backups but not both concurrently through the HM console.
- If the primary HM is unavailable, restoring backups may require CLI interaction and manual configuration.
- Recovering data in a secondary data center requires access to the backup data and the ability to initiate a restore, even without the primary HM.
- Backup metadata is local to each Kubernetes cluster, so backups taken in one cluster aren't automatically visible in another.
Using
The HM console provides tools for configuring and managing backups and restores. You can also use the API to initiate backups to specific locations. You can use kubectl to view backups in a Kubernetes cluster.
For information on how to configure backups and restores of your HM cluster, see HM backup and recovery.