Kubernetes troubleshooting best practices
Hybrid Manager runs on Kubernetes, and effective troubleshooting of Kubernetes clusters is essential for diagnosing and resolving issues that impact Hybrid Manager components and EDB-managed Postgres services.
This page provides troubleshooting best practices and common patterns for investigating Kubernetes environments used with Hybrid Manager. It complements the actionable How to troubleshoot Kubernetes guide, which provides specific commands and procedures.
For background on how Kubernetes is used in Hybrid Manager, see Kubernetes in Hybrid Manager.
Understand the layers involved
When troubleshooting Hybrid Manager on Kubernetes, be aware of the layered architecture:
- Cloud infrastructure layer (nodes, storage, networking)
- Kubernetes cluster layer (control plane, worker nodes, network policies, Services, Ingress)
- Hybrid Manager platform components (UI, API, operators, backup agents)
- Managed Postgres clusters (StatefulSets, PVCs, Services)
- External integrations (cloud storage, identity providers, observability stack)
Understanding where a failure occurs across these layers helps guide the troubleshooting process.
Start with Kubernetes Events and Pod status
Most Hybrid Manager-related issues surface first in Kubernetes Pod status or Events.
Best practices:
- Check Pod status for all relevant namespaces, especially Hybrid Manager namespaces and Postgres namespaces.
- Review recent Kubernetes Events — they often reveal resource scheduling failures, volume mount errors, network policy issues, or image pull problems.
- Pay attention to Pods in Pending, CrashLoopBackOff, or Error states.
Use layered troubleshooting
Troubleshooting should proceed in logical layers:
- Infrastructure → Node Ready status, node resource pressure, storage availability, network health
- Control plane → API server health, etcd health (if self-managed), controller logs
- Workloads → Hybrid Manager components and Postgres clusters → Pod status, Events, logs
- Services → Kubernetes Services and Ingress behavior, LoadBalancer status, DNS resolution
- External integrations → Storage access (IRSA, Workload Identity), IAM/IDP behavior, observability pipeline health
Following this layered flow reduces time spent chasing irrelevant causes.
Validate storage first for stateful issues
Many issues affecting Postgres clusters or Hybrid Manager operators involve Kubernetes storage:
- PVC stuck Pending
- PVC full
- Volume mount errors
- Volume expansion failures
- Underlying cloud disk failures
Validate storage health and capacity early in any troubleshooting process involving databases.
Validate networking for connectivity issues
Network misconfiguration is a common source of application errors:
- Validate Service endpoints and Pod readiness.
- Validate Ingress or LoadBalancer behavior.
- Validate DNS resolution inside the cluster.
- Validate NetworkPolicies are not overly restrictive.
- Check firewall rules and SecurityGroups for LoadBalancer services.
Leverage observability stack
Prometheus, Grafana, and Loki provide key insights:
- Use dashboards to correlate resource usage spikes with incidents.
- Review component-level metrics and logs.
- Trace slow queries or network latency back to Kubernetes or infrastructure causes.
Observability data complements Kubernetes Events and CLI troubleshooting.
Correlate Postgres behavior with Kubernetes signals
When troubleshooting database issues:
- Always cross-reference Postgres metrics and logs with Kubernetes Pod status and Events.
- Many Postgres behaviors (failover, replication lag, connection limits) can be impacted by Kubernetes-level issues like node pressure, network disruptions, or volume latency.
Use this correlation to avoid false root causes.
Document and automate troubleshooting
Best practices:
- Maintain up-to-date runbooks for Hybrid Manager troubleshooting on Kubernetes.
- Automate routine checks and alert enrichment.
- Regularly review past incidents to identify recurring Kubernetes issues.
Automation and good documentation reduce mean time to resolution (MTTR) during production incidents.
Summary checklist
- Understand layered troubleshooting model for Hybrid Manager on Kubernetes.
- Always start with Kubernetes Events and Pod status.
- Validate storage early for stateful workloads.
- Validate networking for connectivity and access issues.
- Leverage observability data alongside Kubernetes CLI troubleshooting.
- Correlate database behavior with Kubernetes signals.
- Maintain runbooks and automate common troubleshooting checks.
Related topics
- On this page
- Understand the layers involved
- Start with Kubernetes Events and Pod status
- Use layered troubleshooting
- Validate storage first for stateful issues
- Validate networking for connectivity issues
- Leverage observability stack
- Correlate Postgres behavior with Kubernetes signals
- Document and automate troubleshooting
- Summary checklist
- Related topics
Could this page be better? Report a problem or suggest an addition!