Disaster recovery overview v1.3.5

Disaster can affect Hybrid Manager (HM) and make it unusable. For example, unavailability of the underlying Cloud Service Provider (CSP) region or an outage in a data center that makes hardware unusable can occur.

A disaster recovery (DR) option allows you to restore your databases to a point-in-time from your available backups. It involves recovering the Control plane (the HM console and HM components) and Data plane (your HM-managed Postgres clusters) after a catastrophic failure.

Unlike High Availability (HA), which provides Postgres database real-time failover, DR is a manual restoration process initiated when the original HM environment is unrecoverable.

Recovery scenarios

  • Restore HM to the original location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC1.

  • Restore HM to an alternative location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC2 (because DC1 is down for an unexpected period of time). The alternative location (DC2) must have access to the same (or replicated) object storage used by DC1.

Core components

DR relies on three components:

  1. Velero An open-source tool included with your HM installation that backs up Kubernetes resources (Definitions, Secrets, and ConfigMaps). It captures the configuration and state of your HM installation, as well as the definition for Postgres database clusters created with it.
  2. Object storage An S3-compatible bucket (AWS S3, GCS, etc.) that holds the Velero backups for the HM installation and its Postgres database clusters.
  3. HM console Use it to re-provision Postgres database clusters and bring back their data.

Scope: the two-layer restore strategy

To understand DR, you must distinguish between the two layers of data being recovered:

LayerContentRecovery method
Control planeHM console and its applications, general settings, user accounts, and database definitions.Velero restore, reinstalls Kubernetes definitions into the new HM installation
Data planeActual Postgres tables, rows, and indexes from database clusters created with HM.HM console restore, leverages Point-in-Time Recovery (PITR) to roll data forward.

The backups for both data layers are stored in an S3-compatible bucket managed with Velero. You define the location and frequency of the backups during the initial HM configuration and installation.

Note

DR procedures don’t restore active migration functionality. While the transporter-db and migration-db databases are recovered, preserving migration history for audit and record-keeping, migrations that were active and impacted by the disaster will need to be re-created.

Workflow: from outage to recovery

When a disaster occurs, the recovery follows this sequence:

  1. Infrastructure preparation Deploy a fresh Kubernetes cluster in a healthy region or data center, and deploy HM to it.

  2. Environment restore Use Velero to pull the HM configuration and Postgres database cluster definitions from your bucket. This brings back the HM console and the records of your previous databases.

  3. Database re-provisioning Log in to the restored HM console. Databases will appear as Deleted or Missing. Use the Restore function in the HM console to spin up new Postgres database clusters that point to the original data in object storage.

    Data freshness

    While Velero restores the cluster definition from the time of the last backup, the re-provisioning process uses Postgres WAL (Write Ahead Logs) to perform a Point-in-Time Recovery. This allows you to recover data that is much more recent than your last Velero snapshot.

Critical requirements for DR success

For this architecture to work, the following must be true before a disaster:

  • Offsite backups: Your S3-compatible bucket must be replicated to a different region or physical site. If the bucket dies with the data center, DR is impossible. See object storage for more information.
  • Consistent credentials: The new HM instance must have the same IAM permissions or secret access to the original object storage bucket to "see" the old backups.
Recovery vs. Resilience

Remember that DR doesn't provide "zero downtime." Your RTO (how long you are offline) is largely determined by the time it takes to provision new infrastructure and the speed of your network when pulling database logs.

Next topic

Configure backups to support your RTO and RPO times.