Hybrid Manager disaster recovery v1.3.5
The Disaster Recovery (DR) procedure is defined as the series of manual steps that you need to take to recover your HM installation and your HM-managed Postgres clusters.
Warning
You must constantly test and update your organization's DR procedure for it to remain valid.
Before you start
Before starting the DR procedure, ensure you have made yourself familiar with:
Prerequisites
A new HM instance deployed and running. It must be running the same version as the old instance that failed or became unavailable.
The container images used to build the clusters in the old, unavailable HM instance are available to the new one.
Required tools
Ensure the following tools are available on your workstation environment or Bastion host:
jq(Command-line JSON processor)yq(Command-line YAML processor)
1. Make backups available in the new HM instance
The first step ensures the backups of the unavailable HM instance (“old backups”) are reachable from the new HM instance by copying the backups of the damaged HM instance to the linked storage (new bucket) of the new HM instance.
Obtain the bucket names, the backup ID, and the region of the new bucket. Store these as environment variables to be used in the commands throughout this guide:
Important
These variables are session-specific. If you open a new terminal tab or your session times out, the variables will be lost, and subsequent commands will fail. To avoid re-typing these values, you could save these export commands into a small shell script (e.g.,
dr-env.sh). You can then "source" that file in any new terminal window to instantly reload your environmentexport OLD_BUCKET=<old_bucket> export OLD_BACKUP_ID=<old_environment_internal_backup_id> export NEW_BUCKET=<new_bucket> export NEW_REGION=<region_of_the_new_bucket>
How do I obtain the old bucket values?
To obtain the old bucket values:
- Go to the console/dashboard of your CSP > buckets.
- Find and select the bucket linked to the backups of your old HM instance.
- Browse through to the
edb-internal-backupsfolder. Inside that folder you will find a subfolder with the backup ID, e.g. 4be7a1c8c9f0.
EKS Example
This is an example for setting the environment variables for an HM instance deployed on EKS:
export OLD_BUCKET=eks-1105143903-2511-edb-postgres export OLD_BACKUP_ID=a7462dbc7106 export NEW_BUCKET=eks-1105155418-2511-edb-postgres export NEW_REGION=eu-west-3
To copy the data from the old bucket to the new bucket, you first need to locate and note the names of the source and target folders. You need to copy the following folders and their content:
Internal EDB backups folder — The internal backups folder in the old bucket
edb-internal-backups/<random-string>is different in the new HM instance, as it will have a different<random-string>.Postgres clusters backups folder —
customer-pg-backups.Folder corresponding to any defined custom storage locations — If you utilize Managed Storage Locations in the HM console (e.g., for offloading Postgres queries), you must ensure the corresponding folders are copied from the old S3-compatible bucket to the new one. While the definitions are restored via Velero, the actual data inside those custom folders must be manually migrated to the new target bucket.
Copy the old backups to the new bucket using your preferred tools. Here are some examples using cloud service provider CLIs to move data between buckets:
aws s3 cp --recursive s3://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} s3://${NEW_BUCKET}/edb-internal-backups aws s3 cp --recursive s3://${OLD_BUCKET}/customer-pg-backups s3://${NEW_BUCKET}/customer-pg-backups
If you have configured additional Managed Storage Locations, use the same method to copy those folders.
gcloud storage cp gs://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} gs://${NEW_BUCKET}/edb-internal-backups --recursive gcloud storage cp gs://${OLD_BUCKET}/customer-pg-backups gs://${NEW_BUCKET}/customer-pg-backups --recursive
If you have configured additional Managed Storage Locations, use the same method to copy those folders.
Load the backups you just copied to your new HM instance by creating a new custom resource definition and applying it to the new HM instance:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: annotations: appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero labels: appliance.enterprisedb.com/storage-credentials: bound name: recovery namespace: velero spec: accessMode: ReadOnly config: insecureSkipTLSVerify: "false" region: ${NEW_REGION} s3ForcePathStyle: "true" default: false objectStorage: bucket: ${NEW_BUCKET} prefix: edb-internal-backups/velero provider: aws EOF
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: annotations: appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero labels: appliance.enterprisedb.com/storage-credentials: bound name: recovery namespace: velero spec: accessMode: ReadOnly credential: key: gcp name: gcs-credentials default: false objectStorage: bucket: ${NEW_BUCKET} prefix: edb-internal-backups/velero provider: gcp EOF
Confirm that the new storage location is available:
velero get backup-locations
If the status is not
Available, check the Velero pod logs for permission errors on the S3 bucket.Confirm that the backups are available as well:
velero get backups --selector velero.io/storage-location=recovery
Choose the backup you want to restore from. You can have multiple backups available, so choose the one that best suits your needs, e.g. the most recent backup before the disaster happened. Note the Velero backup name, as well as the date and time (UTC), as both are required for a restore, for example:
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR velero-backup-kube-state-**20241216154403** Completed 0 0 2024-12-16 16:44:03 \+0100 CET 5d recovery \<none\>
Note
The timestamp value is referred to as the recovery date in the instructions that follow.
(Optional) If you were using HM to manage AI workloads, e.g. with the GenAI Builder, also copy the object store files and CORS configuration from the old bucket to the new one:
export OLD_BUCKET_DATALAKE=<old bucket> export NEW_BUCKET_DATALAKE=<new bucket>
# Copy data lake objects from old bucket to new bucket aws s3 cp --recursive s3://${OLD_BUCKET_DATALAKE}/ s3://${NEW_BUCKET_DATALAKE}/ # Copy CORS configuration from old bucket to new bucket aws s3api get-bucket-cors --bucket ${OLD_BUCKET_DATALAKE} --output json > cors-config.json aws s3api put-bucket-cors --bucket ${NEW_BUCKET_DATALAKE} --cors-configuration file://cors-config.json
# Copy data lake objects from old bucket to new bucket gcloud storage cp "gs://${OLD_BUCKET_DATALAKE}/**" gs://${NEW_BUCKET_DATALAKE}/ --recursive # Copy CORS configuration from old bucket to new bucket gcloud storage buckets describe gs://rhos-uat-griptape-datalake --format="json" | jq .cors_config > cors-config.json gcloud storage buckets update gs://${NEW_BUCKET_DATALAKE} --cors-file=cors-config.json
2. Recovery steps
Restore HM-internal databases
After the old backups are available in the new bucket, you can restore the HM-internal databases. These are back-end services used by HM and are required to fully restore the HM instance. Depending on the HM version you are using and on the installation scenario you have deployed, the list of databases may vary.
To simplify these process, run following script with your kubeconfig pointing to your new HM installation:
Script details
This patch script takes care of:
- Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
- One directory
old-cluster-configswith the current state of the database clusters in the new HM installation (default configuration after installation). - Another one called
new-cluster-configswith the same files, where the script will perform the patches required so the HM-internal databases start using the data from the backups.
- One directory
- Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
Suspend the reconciliation of all HM-internal database clusters, so that you can safely remove the old Custom Resource Definitions (CRDs) of the database clusters without having the operator recreating them by default:
HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name') for CLUSTER in $(kubectl get clusters.postgresql.k8s.enterprisedb.io -A -o json | jq -rc '.items[].metadata | select((.name | test("^p-") | not) and (.name != "stats-collector-db")) | {namespace: .namespace}' | uniq) do NAMESPACE=$(echo "${CLUSTER}" | jq -rc '.namespace') INDEX=$(kubectl get hybridcontrolplane ${HCP_CR} -o json | jq '.status.components | to_entries[] | select(.value.name=='\"${NAMESPACE}\"') | .key') kubectl patch hybridcontrolplane ${HCP_CR} --subresource=status --type=json -p "[{\"op\": \"replace\", \"path\": \"/status/components/$INDEX/suspended\", \"value\": true}]" done
Verify that the components have been suspended correctly:
HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name') kubectl hcp status -n edbpgai-bootstrap "${HCP_CR}"
Delete the HM-internal database clusters that were created during installation of the new HM instance to make room for the HM-internal database clusters that will be recovered from the backup:
for CONFIG in $(find new-cluster-configs -type f) do kubectl delete -f $CONFIG done
Clean the backup area that was created during the installation of the new HM instance to avoid confusion with the old backups that you want to restore:
for CONFIG in $(find new-cluster-configs -type f) do NAME=$(yq '.metadata.name' $CONFIG) NAMESPACE=$(yq '.metadata.namespace' $CONFIG) # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null) if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration) # Get it from the ObjectStore resource instead PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase') fi aws s3 rm --recursive ${PREFIX}/${NAME} done
for CONFIG in $(find new-cluster-configs -type f) do NAME=$(yq '.metadata.name' $CONFIG) NAMESPACE=$(yq '.metadata.namespace' $CONFIG) # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null) if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration) # Get it from the ObjectStore resource instead PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase') fi gsutil -m rm -r ${PREFIX}/${NAME} done
Apply the YAML file for all the HM-internal database clusters to be re-created with the backup data:
for CONFIG in $(find new-cluster-configs -type f) do kubectl apply -f $CONFIG done
You can monitor the restore progress using
kubectl get clusters -A.
Restart HM services
After all HM-internal database clusters are successfully restored and reporting a healthy state, perform this one-time restart of the management server to refresh the HM console:
kubectl delete pods $(kubectl get pods -n upm-beaco-ff-base | grep '^accm-server' | awk '{print $1}') -n upm-beaco-ff-base
Wait for the new pod to reach the Running state. At this point, the HM console is available, though it won't yet show your HM-managed Postgres clusters.
Configure the Velero plugin
The Velero plugin handles the transformation of Kubernetes resources during the restore. Most importantly, it ensures Postgres clusters are restored in a state that allows you to manually trigger their data recovery.
List the available backups and note the Name and Timestamp of your preferred recovery point:
velero get backups -o json --selector velero.io/storage-location=recovery \ | jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \ | column -t -s "$(printf '\t')"
Export the environment variables:
export BACKUP_TIMESTAMP=<recovery date in YYYY–MM-DDTHH:MM:SSZ format> export BACKUP_NAME=<selected name> # These environment variables should already be available in your terminal export OLD_BUCKET=<old bucket name> export NEW_BUCKET=<new bucket name>
Note
The
BACKUP_TIMESTAMPmust be the exact ISO timestamp (e.g.,2024-12-16T15:44:03Z) found in the previous step.Create and apply a
ConfigMapto configure the Velero plugin:kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: velero-plugin-for-edbpgai namespace: velero labels: velero.io/plugin-config: "" enterprisedb.io/edbpgai-plugin: RestoreItemAction data: # configure disaster recovery mode, so restored items are transformed as needed drMode: "true" # configure a date corresponding to the velero backup date. Note the format! drDate: "${BACKUP_TIMESTAMP}" # old and new buckets for internal custom storage locations oldBucket: ${OLD_BUCKET} newBucket: ${NEW_BUCKET} EOF
Restore resources
Restore Managed Storage Locations by applying the following Velero restore. This includes the default
managed-devspatcherlocation as well as any additional custom-defined locations.kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-1-storagelocations namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - storagelocations.biganimal.enterprisedb.com includeClusterResources: true labelSelector: matchLabels: biganimal.enterprisedb.io/reserved-by-biganimal: "false" EOF
Configure and apply the following Velero restore resource manifest to restore the cluster wrappers:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-2-clusterwrappers namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - clusterwrappers.beacon.enterprisedb.com restoreStatus: includedResources: - clusterwrappers.beacon.enterprisedb.com EOF
Monitor the restore progress. You must wait until
clusterwrappersis restored first, because the following custom resources (CR) depend on it. If the correspondingclusterwrapperisn't found, HM could delete the other CRs.velero get restore restore-2-clusterwrappers
After the cluster wrappers are restored, configure and apply the following Velero resource manifest to restore the backup wrappers:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-3-backupwrappers namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - backupwrappers.beacon.enterprisedb.com restoreStatus: includedResources: - backupwrappers.beacon.enterprisedb.com EOF
Configure and apply the following Velero resource manifest to restore Griptape, Lakekeeper and Dex secrets:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-4-required-secrets namespace: velero spec: backupName: "${BACKUP_NAME}" includedNamespaces: - upm-griptape - upm-lakekeeper - upm-dex includedResources: - secrets includeClusterResources: false EOF
(Optional) If you are running AI workloads, configure and apply the following Velero restore resource manifest to restore kserve resources:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-5-kservecrs namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - clusterservingruntimes.serving.kserve.io - inferenceservices.serving.kserve.io EOF
Monitor all restores and wait for them to be completed:
velero get restores
3. Restore Postgres clusters
The cluster metadata has been restored, but the HM-managed Postgres clusters must be manually re-provisioned to link back to your data.
In the HM console, navigate to the databases section. You will see your original clusters listed with a status of Deleted.
Select the desired cluster and locate the Restore button. Follow the prompts to create a new cluster. During this process, the system will use your previous backups to populate the new instance.
After provisioning is complete, verify that the data matches your original state.
You can apply the same procedure to restore any Postgres clusters you had configured on a secondary location.
Note
AI components (such as the GenAI Builder UI in the Launchpad section) will automatically reappear in the HM console once the restore is initiated. Due to the large size of container images and profiles, synchronization may take some time.
4. Validate the restore
The restoration procedure is now complete. To ensure a successful recovery, we recommend checking for data integrity. Log in to the newly provisioned Postgres cluster and run a few test queries to confirm your data is current and accessible.
Tip
If you are performing this as part of a DR drill, internally document the total "Time to Restore" (TTR) for both the database and AI layers to help refine your recovery objectives (RTO).