Monitoring with Postflight checks Innovation Release

Check the results of the Postflight check to monitor your HM deployment is ready.

Role focus: Infrastructure Engineer / Platform Administrator

Prerequisites

  • Phase 5: Installing Hybrid Manager You have completed and verified the installation of HM. The deployment is running and you want to either check its state or troubleshoot failed checks.

Outcomes

  • Confirmation that the deployed Hybrid Manager (HM) platform is healthy and ready for day-2 configuration.

The Postflight custom resource provides automated health monitoring for your deployed HM stack. It periodically runs a set of built-in checks and reports an overall health status.

Note

Unlike preflight checks, postflight provides monitoring capabilities. It does not block any operations or trigger automated remediation. It provides visibility for troubleshooting.

What postflight checks do

The operator uses a Postflight resource to run 5 built-in health checks at a configurable interval. Checks only begin after the HybridControlPlane reaches the deployed phase. Each check verifies a different aspect of the running platform:

CheckWhat it verifies
PodsAll HM pods are running and ready.
DatabasesAll internal EDB Postgres for Kubernetes clusters are ready and their latest backups completed.
Velero backupsThe latest Velero backup completed successfully.
NodesAll Kubernetes cluster nodes are ready.
CertificatesAll cert-manager certificates are valid and ready.

If all checks pass, the overall phase is Healthy. If any check fails, the phase is Unhealthy.

Checking the postflight status

Quick overview

Check whether your installation is healthy:

kubectl get postflight

The output should look like this:

NAME                  PHASE     LAST CHECK
edbpgai-postflight    Healthy   2025-01-15T10:30:00Z

Detailed results

If the phase is Unhealthy, obtain more detailed results:

kubectl get postflight <name> -o yaml

The status.checkResults section shows individual results:

status:
  phase: Unhealthy
  lastTransitionTime: "2025-01-15T10:30:00Z"
  checkResults:
    - name: PodsChecker
      description: "Check whether all pods belonging to the Hybrid Manager are ready."
      state: Passed
    - name: DBsChecker
      description: "Check whether all internal databases belonging to the Hybrid Manager are ready."
      state: Passed
    - name: VeleroChecker
      description: "Check whether the latest Velero Backup has been completed."
      state: Failed
      messages:
        - "the latest velero backup is in a unhealthy phase: PartiallyFailed"
    - name: NodesChecker
      description: "Check whether all nodes are ready."
      state: Passed
    - name: CertificatesChecker
      description: "Check whether all certificates belonging to the Hybrid Manager are valid."
      state: Passed

Troubleshooting unhealthy checks

When the postflight phase is Unhealthy, inspect the status.checkResults section to identify which checks are failing and their associated messages.

Pods check is failing

One or more HM pods are not in a ready state. Investigate the failing pods:

kubectl get pods -A | grep -v Running
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

Common causes include image pull failures, resource limits, and crashlooping containers.

Databases check is failing

An EDB Postgres for Kubernetes database cluster is not ready, or its latest backup failed. Check the cluster and backup status:

kubectl get clusters.postgresql.k8s.enterprisedb.io -A
kubectl get backups.postgresql.k8s.enterprisedb.io -A --sort-by=.metadata.creationTimestamp

Common causes include storage issues, connection failures, and WAL archiving problems.

Velero backup check is failing

The latest Velero backup is in a failed or partially failed state:

kubectl get backups.velero.io -n velero --sort-by=.metadata.creationTimestamp
kubectl describe backup <backup-name> -n velero

Nodes check is failing

One or more cluster nodes are not ready. This affects the entire cluster, not just HM:

kubectl get nodes
kubectl describe node <node-name>

Common causes include disk pressure, memory pressure, network unreachable conditions, and kubelet issues.

Certificates check is failing

A cert-manager certificate has not been issued or has expired:

kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
kubectl get certificaterequests -A

Common causes include issuer configuration problems, rate limiting (for ACME/Let's Encrypt), and DNS validation failures.

Configuring the check interval

You can also change the interval in which the postflight check runs.

The spec.interval field controls how often checks run. The operator respects this interval when the phase is Healthy, requeuing for the remaining duration after each successful run.

apiVersion: edbpgai.edb.com/v1alpha1
kind: Postflight
metadata:
  name: edbpgai-postflight
spec:
  hm:
    name: edbpgai
  interval: 30m
Note

When the phase transitions to Unhealthy, the status is updated immediately. The interval only affects the cadence of re-checks when the system is healthy.