Hybrid Manager disaster recovery v1.3.5

The Disaster Recovery (DR) procedure is defined as the series of manual steps that you need to take to recover your HM installation and your HM-managed Postgres clusters.

Warning

You must constantly test and update your organization's DR procedure for it to remain valid.

Before you start

Before starting the DR procedure, ensure you have made yourself familiar with:

Prerequisites

  • A new HM instance deployed and running. It must be running the same version as the old instance that failed or became unavailable.

  • The container images used to build the clusters in the old, unavailable HM instance are available to the new one.

Required tools

Ensure the following tools are available on your workstation environment or Bastion host:

  • Velero CLI

  • jq (Command-line JSON processor)

  • yq (Command-line YAML processor)

1. Make backups available in the new HM instance

The first step ensures the backups of the unavailable HM instance (“old backups”) are reachable from the new HM instance by copying the backups of the damaged HM instance to the linked storage (new bucket) of the new HM instance.

  1. Obtain the bucket names, the backup ID, and the region of the new bucket. Store these as environment variables to be used in the commands throughout this guide:

    Important

    These variables are session-specific. If you open a new terminal tab or your session times out, the variables will be lost, and subsequent commands will fail. To avoid re-typing these values, you could save these export commands into a small shell script (e.g., dr-env.sh). You can then "source" that file in any new terminal window to instantly reload your environment

    export OLD_BUCKET=<old_bucket>
    export OLD_BACKUP_ID=<old_environment_internal_backup_id>
    export NEW_BUCKET=<new_bucket>
    export NEW_REGION=<region_of_the_new_bucket>
    How do I obtain the old bucket values?

    To obtain the old bucket values:

    1. Go to the console/dashboard of your CSP > buckets.
    2. Find and select the bucket linked to the backups of your old HM instance.
    3. Browse through to the edb-internal-backups folder. Inside that folder you will find a subfolder with the backup ID, e.g. 4be7a1c8c9f0.

    EKS Example

    This is an example for setting the environment variables for an HM instance deployed on EKS:

    export OLD_BUCKET=eks-1105143903-2511-edb-postgres
    export OLD_BACKUP_ID=a7462dbc7106
    export NEW_BUCKET=eks-1105155418-2511-edb-postgres
    export NEW_REGION=eu-west-3

  2. To copy the data from the old bucket to the new bucket, you first need to locate and note the names of the source and target folders. You need to copy the following folders and their content:

    • Internal EDB backups folder The internal backups folder in the old bucket edb-internal-backups/<random-string> is different in the new HM instance, as it will have a different <random-string>.

    • Postgres clusters backups folder customer-pg-backups.

    • Folder corresponding to any defined custom storage locations If you utilize Managed Storage Locations in the HM console (e.g., for offloading Postgres queries), you must ensure the corresponding folders are copied from the old S3-compatible bucket to the new one. While the definitions are restored via Velero, the actual data inside those custom folders must be manually migrated to the new target bucket.

  3. Copy the old backups to the new bucket using your preferred tools. Here are some examples using cloud service provider CLIs to move data between buckets:

  4. Load the backups you just copied to your new HM instance by creating a new custom resource definition and applying it to the new HM instance:

  5. Confirm that the new storage location is available:

    velero get backup-locations

    If the status is not Available, check the Velero pod logs for permission errors on the S3 bucket.

  6. Confirm that the backups are available as well:

    velero get backups --selector velero.io/storage-location=recovery
  7. Choose the backup you want to restore from. You can have multiple backups available, so choose the one that best suits your needs, e.g. the most recent backup before the disaster happened. Note the Velero backup name, as well as the date and time (UTC), as both are required for a restore, for example:

    NAME                                      STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR  
    velero-backup-kube-state-**20241216154403**   Completed   0        0          2024-12-16 16:44:03 \+0100 CET   5d        recovery           \<none\>
    Note

    The timestamp value is referred to as the recovery date in the instructions that follow.

  8. (Optional) If you were using HM to manage AI workloads, e.g. with the GenAI Builder, also copy the object store files and CORS configuration from the old bucket to the new one:

    export OLD_BUCKET_DATALAKE=<old bucket>
    export NEW_BUCKET_DATALAKE=<new bucket>

2. Recovery steps

Restore HM-internal databases

After the old backups are available in the new bucket, you can restore the HM-internal databases. These are back-end services used by HM and are required to fully restore the HM instance. Depending on the HM version you are using and on the installation scenario you have deployed, the list of databases may vary.

  1. To simplify these process, run following script with your kubeconfig pointing to your new HM installation:

    patch-clusters.sh

    Script details

    This patch script takes care of:

    • Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
      • One directory old-cluster-configs with the current state of the database clusters in the new HM installation (default configuration after installation).
      • Another one called new-cluster-configs with the same files, where the script will perform the patches required so the HM-internal databases start using the data from the backups.

  2. Suspend the reconciliation of all HM-internal database clusters, so that you can safely remove the old Custom Resource Definitions (CRDs) of the database clusters without having the operator recreating them by default:

    HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name')
    for CLUSTER in $(kubectl get clusters.postgresql.k8s.enterprisedb.io -A -o json | jq -rc '.items[].metadata | select((.name | test("^p-") | not) and (.name != "stats-collector-db")) | {namespace: .namespace}' | uniq)
    do
     NAMESPACE=$(echo "${CLUSTER}" | jq -rc '.namespace')
     INDEX=$(kubectl get hybridcontrolplane ${HCP_CR} -o json | jq '.status.components | to_entries[] | select(.value.name=='\"${NAMESPACE}\"') | .key')
     kubectl patch hybridcontrolplane ${HCP_CR} --subresource=status --type=json -p "[{\"op\": \"replace\", \"path\": \"/status/components/$INDEX/suspended\", \"value\": true}]"
    done
  3. Verify that the components have been suspended correctly:

    HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name')
    kubectl hcp status -n edbpgai-bootstrap "${HCP_CR}"
  4. Delete the HM-internal database clusters that were created during installation of the new HM instance to make room for the HM-internal database clusters that will be recovered from the backup:

    for CONFIG in $(find new-cluster-configs -type f)
    do
        kubectl delete -f $CONFIG
    done
  5. Clean the backup area that was created during the installation of the new HM instance to avoid confusion with the old backups that you want to restore:

  6. Apply the YAML file for all the HM-internal database clusters to be re-created with the backup data:

    for CONFIG in $(find new-cluster-configs -type f)
    do
        kubectl apply -f $CONFIG
    done

    You can monitor the restore progress using kubectl get clusters -A.

Restart HM services

After all HM-internal database clusters are successfully restored and reporting a healthy state, perform this one-time restart of the management server to refresh the HM console:

kubectl delete pods $(kubectl get pods -n upm-beaco-ff-base | grep '^accm-server' | awk '{print $1}') -n upm-beaco-ff-base

Wait for the new pod to reach the Running state. At this point, the HM console is available, though it won't yet show your HM-managed Postgres clusters.

Configure the Velero plugin

The Velero plugin handles the transformation of Kubernetes resources during the restore. Most importantly, it ensures Postgres clusters are restored in a state that allows you to manually trigger their data recovery.

  1. List the available backups and note the Name and Timestamp of your preferred recovery point:

    velero get backups -o json --selector velero.io/storage-location=recovery \
    | jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \
    | column -t -s "$(printf '\t')"
  2. Export the environment variables:

    export BACKUP_TIMESTAMP=<recovery date in YYYY–MM-DDTHH:MM:SSZ format>
    export BACKUP_NAME=<selected name>
    # These environment variables should already be available in your terminal
    export OLD_BUCKET=<old bucket name>
    export NEW_BUCKET=<new bucket name>
    Note

    The BACKUP_TIMESTAMP must be the exact ISO timestamp (e.g., 2024-12-16T15:44:03Z) found in the previous step.

  3. Create and apply a ConfigMap to configure the Velero plugin:

    kubectl apply -f - <<EOF
    apiVersion: v1  
    kind: ConfigMap  
    metadata:  
      name: velero-plugin-for-edbpgai  
      namespace: velero  
      labels:  
        velero.io/plugin-config: ""  
        enterprisedb.io/edbpgai-plugin: RestoreItemAction  
    data:  
      # configure disaster recovery mode, so restored items are transformed as needed  
      drMode: "true"  
      # configure a date corresponding to the velero backup date. Note the format!  
      drDate: "${BACKUP_TIMESTAMP}"  
      # old and new buckets for internal custom storage locations  
      oldBucket: ${OLD_BUCKET}  
      newBucket: ${NEW_BUCKET}
    EOF

Restore resources

  1. Restore Managed Storage Locations by applying the following Velero restore. This includes the default managed-devspatcher location as well as any additional custom-defined locations.

    kubectl apply -f - <<EOF
    apiVersion: velero.io/v1  
    kind: Restore  
    metadata:  
      name: restore-1-storagelocations  
      namespace: velero  
    spec:  
      backupName: "${BACKUP_NAME}" 
      includedResources:  
       - storagelocations.biganimal.enterprisedb.com  
      includeClusterResources: true  
      labelSelector:  
        matchLabels:  
          biganimal.enterprisedb.io/reserved-by-biganimal: "false"
    EOF
  2. Configure and apply the following Velero restore resource manifest to restore the cluster wrappers:

    kubectl apply -f - <<EOF
    apiVersion: velero.io/v1  
    kind: Restore  
    metadata:  
      name: restore-2-clusterwrappers  
      namespace: velero  
    spec:  
      backupName: "${BACKUP_NAME}" 
      includedResources:  
       - clusterwrappers.beacon.enterprisedb.com  
      restoreStatus:  
        includedResources:  
         - clusterwrappers.beacon.enterprisedb.com
    EOF
  3. Monitor the restore progress. You must wait until clusterwrappers is restored first, because the following custom resources (CR) depend on it. If the corresponding clusterwrapper isn't found, HM could delete the other CRs.

    velero get restore restore-2-clusterwrappers
  4. After the cluster wrappers are restored, configure and apply the following Velero resource manifest to restore the backup wrappers:

    kubectl apply -f - <<EOF
    apiVersion: velero.io/v1  
    kind: Restore  
    metadata:  
      name: restore-3-backupwrappers  
      namespace: velero  
    spec:   
      backupName: "${BACKUP_NAME}" 
      includedResources:  
       - backupwrappers.beacon.enterprisedb.com  
      restoreStatus:  
        includedResources:  
         - backupwrappers.beacon.enterprisedb.com
    EOF
  5. Configure and apply the following Velero resource manifest to restore Griptape, Lakekeeper and Dex secrets:

    kubectl apply -f - <<EOF
    apiVersion: velero.io/v1
    kind: Restore
    metadata:
      name: restore-4-required-secrets
      namespace: velero
    spec:
      backupName: "${BACKUP_NAME}"
      includedNamespaces:
      - upm-griptape
      - upm-lakekeeper
      - upm-dex
      includedResources:
      - secrets
      includeClusterResources: false
    EOF
  6. (Optional) If you are running AI workloads, configure and apply the following Velero restore resource manifest to restore kserve resources:

    kubectl apply -f - <<EOF
    apiVersion: velero.io/v1
    kind: Restore
    metadata:
      name: restore-5-kservecrs
      namespace: velero
    spec:
      backupName: "${BACKUP_NAME}"
      includedResources:
      - clusterservingruntimes.serving.kserve.io
      - inferenceservices.serving.kserve.io
    EOF
  7. Monitor all restores and wait for them to be completed:

    velero get restores

3. Restore Postgres clusters

The cluster metadata has been restored, but the HM-managed Postgres clusters must be manually re-provisioned to link back to your data.

  1. In the HM console, navigate to the databases section. You will see your original clusters listed with a status of Deleted.

  2. Select the desired cluster and locate the Restore button. Follow the prompts to create a new cluster. During this process, the system will use your previous backups to populate the new instance.

  3. After provisioning is complete, verify that the data matches your original state.

You can apply the same procedure to restore any Postgres clusters you had configured on a secondary location.

Note

AI components (such as the GenAI Builder UI in the Launchpad section) will automatically reappear in the HM console once the restore is initiated. Due to the large size of container images and profiles, synchronization may take some time.

4. Validate the restore

The restoration procedure is now complete. To ensure a successful recovery, we recommend checking for data integrity. Log in to the newly provisioned Postgres cluster and run a few test queries to confirm your data is current and accessible.

Tip

If you are performing this as part of a DR drill, internally document the total "Time to Restore" (TTR) for both the database and AI layers to help refine your recovery objectives (RTO).