Disaster recovery overview Innovation Release

This documentation covers the current Innovation Release of EDB Postgres AI. See also:

Hybrid Manager dual release strategy
Documentation for the current Long-term support release

Disaster can affect Hybrid Manager (HM) and make it unusable. For example, unavailability of the underlying Cloud Service Provider (CSP) region or an outage in a data center that makes hardware unusable can occur.

A disaster recovery (DR) option allows you to restore your databases to a point-in-time from your available backups. It involves recovering the Control plane (the HM console and HM components) and Data plane (your HM-managed Postgres clusters) after a catastrophic failure.

Unlike High Availability (HA), which provides Postgres database real-time failover, DR is a manual restoration process initiated when the original HM environment is unrecoverable.

Recovery scenarios

Restore HM to the original location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC1.
Restore HM to an alternative location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC2 (because DC1 is down for an unexpected period of time). The alternative location (DC2) must have access to the same (or replicated) object storage used by DC1.

Core components

DR relies on three components:

Velero — An open-source tool included with your HM installation that backs up Kubernetes resources (Definitions, Secrets, and ConfigMaps). It captures the configuration and state of your HM installation, as well as the definition for Postgres database clusters created with it.
Object storage — An S3-compatible bucket (AWS S3, GCS, etc.) that holds the Velero backups for the HM installation and its Postgres database clusters.
HM console — Use it to re-provision Postgres database clusters and bring back their data.

Scope: the two-layer restore strategy

To understand DR, you must distinguish between the two layers of data being recovered:

Layer	Content	Recovery method
Control plane	HM console and its applications, general settings, user accounts, and database definitions.	Velero restore, reinstalls Kubernetes definitions into the new HM installation
Data plane	Actual Postgres tables, rows, and indexes from database clusters created with HM.	HM console restore, leverages Point-in-Time Recovery (PITR) to roll data forward.

The backups for both data layers are stored in an S3-compatible bucket managed with Velero. You define the location and frequency of the backups during the initial HM configuration and installation.

Data recovery considerations

Migrations: DR procedures don’t restore active migration functionality. While the HM-internal transporter-db and migration-db databases are recovered, preserving migration history for audit and record-keeping, migrations that were active and impacted by the disaster will need to be re-created.
Query diagnostics and recommendations: Data displayed in the HM console > Query Diagnostics and Recommendations tabs of a cluster's view is not recoverable after a disaster. This data is intentionally excluded from backups as it is transient. After HM is restored, it will automatically begin generating new data that reflects the current state of your Postgres clusters.
Monitoring: Historical metrics (Thanos data)—displayed in the HM console > Monitoring tab of a cluster's view—are recoverable. Metrics will reappear automatically following the disaster recovery procedure.

Workflow: from outage to recovery

When a disaster occurs, the recovery follows this sequence:

Infrastructure preparation — Deploy a fresh Kubernetes cluster in a healthy region or data center, and deploy HM to it.
Environment restore — Use Velero to pull the HM configuration and Postgres database cluster definitions from your bucket. This brings back the HM console and the records of your previous databases.
Database re-provisioning — Log in to the restored HM console. Databases will appear as Deleted or Missing. Use the Restore function in the HM console to spin up new Postgres database clusters that point to the original data in object storage.
Data freshness
While Velero restores the cluster definition from the time of the last backup, the re-provisioning process uses Postgres WAL (Write Ahead Logs) to perform a Point-in-Time Recovery. This allows you to recover data that is much more recent than your last Velero snapshot.

Critical requirements for DR success

For this architecture to work, the following must be true before a disaster:

Offsite backups: Your S3-compatible bucket must be replicated to a different region or physical site. If the bucket dies with the data center, DR is impossible. See object storage for more information.
Consistent credentials: The new HM instance must have the same IAM permissions or secret access to the original object storage bucket to "see" the old backups.

Recovery vs. Resilience

Remember that DR doesn't provide "zero downtime." Your RTO (how long you are offline) is largely determined by the time it takes to provision new infrastructure and the speed of your network when pulling database logs.

Next topic

Configure backups to support your RTO and RPO times.

← Prev

Hybrid Manager backup and disaster recovery (DR)

↑ Up