Troubleshooting on Kubernetes

Suggest edits

Kubernetes is a powerful platform for running containerized applications, but its distributed nature can make troubleshooting challenging. This guide covers common troubleshooting patterns, tools, and techniques to help you diagnose and resolve issues in Kubernetes environments.

Common troubleshooting goals

When troubleshooting Kubernetes, you are often trying to answer questions such as:

Is the cluster healthy and responsive?
Are Pods scheduled and running correctly?
Are Services routing traffic as expected?
Are ConfigMaps and Secrets properly configured?
Are PersistentVolumes and PersistentVolumeClaims healthy?
Are applications accessible externally and internally?
Are resource limits and quotas causing problems?
Are nodes available and healthy?

Core troubleshooting techniques

Checking cluster health

Check node status:
kubectl get nodes
kubectl describe node <node-name>
Check component health (if using kubeadm or self-managed control plane):
kubectl get componentstatuses

Investigating Pod issues

Check Pod status:
kubectl get pods -n <namespace>
Describe a Pod to view detailed events:
kubectl describe pod <pod-name> -n <namespace>
View Pod logs:
kubectl logs <pod-name> -n <namespace>
kubectl logs --previous <pod-name> -n <namespace> (if restarting)
Check resource usage:
kubectl top pods -n <namespace>

Investigating node issues

Check node resource usage:
kubectl top nodes
Look for scheduling issues in Pod Events (e.g., Insufficient CPU/Memory).
Investigate potential disk pressure or memory pressure conditions.

Troubleshooting networking

Verify Service configuration:
kubectl get services -n <namespace>
kubectl describe service <service-name> -n <namespace>
Verify Endpoints:
kubectl get endpoints <service-name> -n <namespace>
Debug DNS resolution:
From within a Pod: nslookup <service-name>.<namespace>.svc.cluster.local
Inspect Ingress resources (if used):
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>

Troubleshooting storage

Check PVC status:
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
Check StorageClass:
kubectl get storageclass
kubectl describe storageclass <storageclass-name>
Inspect CSI driver Pods (typically in kube-system namespace):
kubectl get pods -n kube-system | grep csi
kubectl logs <csi-pod-name> -n kube-system

Investigating configuration issues

Check ConfigMaps:
kubectl get configmaps -n <namespace>
kubectl describe configmap <configmap-name> -n <namespace>
Check Secrets:
kubectl get secrets -n <namespace>
kubectl describe secret <secret-name> -n <namespace>

Monitoring and performance tuning

Use kubectl top to monitor resource usage.
Set resource requests and limits carefully to prevent evictions and throttling.
Review node capacity and consider scaling the cluster if resource pressure is high.

Common symptoms and troubleshooting patterns

Symptom	Common causes	Suggested checks
Pods stuck in `Pending`	Insufficient resources, unschedulable nodes, missing PVCs	`kubectl describe pod`, `kubectl describe node`
Pods in `CrashLoopBackOff`	Application errors, configuration issues, failed dependencies	`kubectl logs`, `kubectl describe pod`
Services not reachable	Missing or misconfigured Endpoints, Ingress issues, firewall rules	`kubectl describe service`, `kubectl get endpoints`
PVC stuck in `Pending`	StorageClass issues, CSI driver problems	`kubectl describe pvc`, `kubectl describe storageclass`
High resource usage	Misconfigured limits, noisy neighbors, capacity limits	`kubectl top pods`, `kubectl top nodes`

Useful tools and resources

kubectl: Primary CLI for Kubernetes troubleshooting.
k9s: Terminal UI to explore and troubleshoot Kubernetes resources.
Lens: Desktop Kubernetes dashboard with troubleshooting capabilities.
Cloud provider monitoring tools (e.g., AWS CloudWatch, GCP Monitoring, Azure Monitor).
Prometheus / Grafana: Advanced metrics collection and visualization.

Summary

Troubleshooting Kubernetes involves inspecting multiple layers of the system:

Nodes and cluster components
Workload Pods and their resource usage
Networking, Services, and Ingress
Storage and PersistentVolumeClaims
Configuration via ConfigMaps and Secrets

Develop a systematic approach using Kubernetes-native tools (kubectl, k9s), observability platforms (Prometheus, Grafana), and your cloud provider’s services to effectively diagnose and resolve issues in your Kubernetes environments.

Scaling Kubernetes clusters

Could this page be better? Report a problem or suggest an addition!