Monitor Hybrid Manager cluster state

This how-to guide explains how to monitor the state of the Hybrid Manager (HM) cluster and its key components.

This is useful for:

Understanding current system health
Monitoring ongoing Hybrid Manager operations
Establishing baseline cluster state
Supporting troubleshooting workflows
Performing periodic system checks

Important: This guide covers monitoring Hybrid Manager platform components — not the managed Postgres clusters.

What to monitor

Monitor the following aspects of the Hybrid Manager environment:

Hybrid Manager UI/API availability and responsiveness
Operator health and reconciliation activity
Backup agent and Transporter job status
Storage location operator status
Beacon and telemetry activity
Log pipeline health (Loki / Fluent Bit)
Kubernetes cluster resource health (nodes, Pods, PVCs, Services)

Key tools and data sources

Use the following tools:

Grafana dashboards → metrics and high-level system views
Prometheus → raw metrics exploration
Loki (via Grafana Explore) → component logs
kubectl → Kubernetes component status and Events
Cloud provider dashboards → LoadBalancer health, storage health (optional)

Grafana dashboards to monitor

Recommended dashboards:

Kubernetes cluster dashboard → node health, resource usage, Pod health
Hybrid Manager platform dashboard → UI/API availability, operator metrics, Transporter metrics
Postgres operator dashboard → operator health, reconciliation metrics
Transporter dashboard → backup and data movement metrics
Beacon/Telemetry dashboard → data flow to observability stack

Work with your platform team to validate and tune these dashboards.

Key Kubernetes checks

Use kubectl to monitor component health:

kubectl get pods -A → look for Pending, CrashLoopBackOff, Error states
kubectl get nodes → validate node Ready status and resource pressure
kubectl get pvc -A → monitor PVC capacity and binding status
kubectl get svc -A → validate LoadBalancer and Service endpoints

Key log checks

Use Loki and/or kubectl logs:

Look for recent errors in API, UI, operator, Transporter, and Beacon logs.
Validate regular reconciliation activity in operator logs.
Validate successful backup and Transporter job completion.
Look for errors in Fluent Bit or Loki components (log pipeline health).

Recommended monitoring cadence

Ongoing → dashboard and alert monitoring
Daily checks → Pod health, node health, key component logs
Pre/post change → baseline and compare component state
Periodic validation → test full monitoring pipeline (logs, metrics, dashboards)

Summary checklist

Use Grafana dashboards to monitor overall Hybrid Manager platform state.
Use Prometheus for detailed metric validation.
Use Loki and kubectl logs for log-based troubleshooting.
Monitor Kubernetes node, Pod, PVC, and Service health.
Establish baseline state and monitor deviations.
Perform regular cluster state validation to detect emerging issues.

Could this page be better? Report a problem or suggest an addition!