Troubleshooting on Kubernetes
Kubernetes is a powerful platform for running containerized applications, but its distributed nature can make troubleshooting challenging. This guide covers common troubleshooting patterns, tools, and techniques to help you diagnose and resolve issues in Kubernetes environments.
Common troubleshooting goals
When troubleshooting Kubernetes, you are often trying to answer questions such as:
- Is the cluster healthy and responsive?
- Are Pods scheduled and running correctly?
- Are Services routing traffic as expected?
- Are ConfigMaps and Secrets properly configured?
- Are PersistentVolumes and PersistentVolumeClaims healthy?
- Are applications accessible externally and internally?
- Are resource limits and quotas causing problems?
- Are nodes available and healthy?
Core troubleshooting techniques
Checking cluster health
- Check node status:
kubectl get nodeskubectl describe node <node-name>- Check component health (if using kubeadm or self-managed control plane):
kubectl get componentstatuses
Investigating Pod issues
- Check Pod status:
kubectl get pods -n <namespace>- Describe a Pod to view detailed events:
kubectl describe pod <pod-name> -n <namespace>- View Pod logs:
kubectl logs <pod-name> -n <namespace>kubectl logs --previous <pod-name> -n <namespace>(if restarting)- Check resource usage:
kubectl top pods -n <namespace>
Investigating node issues
- Check node resource usage:
kubectl top nodes- Look for scheduling issues in Pod Events (e.g., Insufficient CPU/Memory).
- Investigate potential disk pressure or memory pressure conditions.
Troubleshooting networking
- Verify Service configuration:
kubectl get services -n <namespace>kubectl describe service <service-name> -n <namespace>- Verify Endpoints:
kubectl get endpoints <service-name> -n <namespace>- Debug DNS resolution:
- From within a Pod:
nslookup <service-name>.<namespace>.svc.cluster.local - Inspect Ingress resources (if used):
kubectl get ingress -n <namespace>kubectl describe ingress <ingress-name> -n <namespace>
Troubleshooting storage
- Check PVC status:
kubectl get pvc -n <namespace>kubectl describe pvc <pvc-name> -n <namespace>- Check StorageClass:
kubectl get storageclasskubectl describe storageclass <storageclass-name>- Inspect CSI driver Pods (typically in
kube-systemnamespace): kubectl get pods -n kube-system | grep csikubectl logs <csi-pod-name> -n kube-system
Investigating configuration issues
- Check ConfigMaps:
kubectl get configmaps -n <namespace>kubectl describe configmap <configmap-name> -n <namespace>- Check Secrets:
kubectl get secrets -n <namespace>kubectl describe secret <secret-name> -n <namespace>
Monitoring and performance tuning
- Use
kubectl topto monitor resource usage. - Set resource requests and limits carefully to prevent evictions and throttling.
- Review node capacity and consider scaling the cluster if resource pressure is high.
Common symptoms and troubleshooting patterns
| Symptom | Common causes | Suggested checks |
|---|---|---|
Pods stuck in Pending | Insufficient resources, unschedulable nodes, missing PVCs | kubectl describe pod, kubectl describe node |
Pods in CrashLoopBackOff | Application errors, configuration issues, failed dependencies | kubectl logs, kubectl describe pod |
| Services not reachable | Missing or misconfigured Endpoints, Ingress issues, firewall rules | kubectl describe service, kubectl get endpoints |
PVC stuck in Pending | StorageClass issues, CSI driver problems | kubectl describe pvc, kubectl describe storageclass |
| High resource usage | Misconfigured limits, noisy neighbors, capacity limits | kubectl top pods, kubectl top nodes |
Useful tools and resources
- kubectl: Primary CLI for Kubernetes troubleshooting.
- k9s: Terminal UI to explore and troubleshoot Kubernetes resources.
- Lens: Desktop Kubernetes dashboard with troubleshooting capabilities.
- Cloud provider monitoring tools (e.g., AWS CloudWatch, GCP Monitoring, Azure Monitor).
- Prometheus / Grafana: Advanced metrics collection and visualization.
Summary
Troubleshooting Kubernetes involves inspecting multiple layers of the system:
- Nodes and cluster components
- Workload Pods and their resource usage
- Networking, Services, and Ingress
- Storage and PersistentVolumeClaims
- Configuration via ConfigMaps and Secrets
Develop a systematic approach using Kubernetes-native tools (kubectl, k9s), observability platforms (Prometheus, Grafana), and your cloud provider’s services to effectively diagnose and resolve issues in your Kubernetes environments.
Related topics
Could this page be better? Report a problem or suggest an addition!