Kubernetes troubleshooting best practices

Suggest edits

Hybrid Manager runs on Kubernetes, and effective troubleshooting of Kubernetes clusters is essential for diagnosing and resolving issues that impact Hybrid Manager components and EDB-managed Postgres services.

This page provides troubleshooting best practices and common patterns for investigating Kubernetes environments used with Hybrid Manager. It complements the actionable How to troubleshoot Kubernetes guide, which provides specific commands and procedures.

For background on how Kubernetes is used in Hybrid Manager, see Kubernetes in Hybrid Manager.

Understand the layers involved

When troubleshooting Hybrid Manager on Kubernetes, be aware of the layered architecture:

Cloud infrastructure layer (nodes, storage, networking)
Kubernetes cluster layer (control plane, worker nodes, network policies, Services, Ingress)
Hybrid Manager platform components (UI, API, operators, backup agents)
Managed Postgres clusters (StatefulSets, PVCs, Services)
External integrations (cloud storage, identity providers, observability stack)

Understanding where a failure occurs across these layers helps guide the troubleshooting process.

Start with Kubernetes Events and Pod status

Most Hybrid Manager-related issues surface first in Kubernetes Pod status or Events.

Best practices:

Check Pod status for all relevant namespaces, especially Hybrid Manager namespaces and Postgres namespaces.
Review recent Kubernetes Events — they often reveal resource scheduling failures, volume mount errors, network policy issues, or image pull problems.
Pay attention to Pods in Pending, CrashLoopBackOff, or Error states.

Use layered troubleshooting

Troubleshooting should proceed in logical layers:

Infrastructure → Node Ready status, node resource pressure, storage availability, network health
Control plane → API server health, etcd health (if self-managed), controller logs
Workloads → Hybrid Manager components and Postgres clusters → Pod status, Events, logs
Services → Kubernetes Services and Ingress behavior, LoadBalancer status, DNS resolution
External integrations → Storage access (IRSA, Workload Identity), IAM/IDP behavior, observability pipeline health

Following this layered flow reduces time spent chasing irrelevant causes.

Validate storage first for stateful issues

Many issues affecting Postgres clusters or Hybrid Manager operators involve Kubernetes storage:

PVC stuck Pending
PVC full
Volume mount errors
Volume expansion failures
Underlying cloud disk failures

Validate storage health and capacity early in any troubleshooting process involving databases.

Validate networking for connectivity issues

Network misconfiguration is a common source of application errors:

Validate Service endpoints and Pod readiness.
Validate Ingress or LoadBalancer behavior.
Validate DNS resolution inside the cluster.
Validate NetworkPolicies are not overly restrictive.
Check firewall rules and SecurityGroups for LoadBalancer services.

Leverage observability stack

Prometheus, Grafana, and Loki provide key insights:

Use dashboards to correlate resource usage spikes with incidents.
Review component-level metrics and logs.
Trace slow queries or network latency back to Kubernetes or infrastructure causes.

Observability data complements Kubernetes Events and CLI troubleshooting.

Correlate Postgres behavior with Kubernetes signals

When troubleshooting database issues:

Always cross-reference Postgres metrics and logs with Kubernetes Pod status and Events.
Many Postgres behaviors (failover, replication lag, connection limits) can be impacted by Kubernetes-level issues like node pressure, network disruptions, or volume latency.

Use this correlation to avoid false root causes.

Document and automate troubleshooting