Troubleshooting on Kubernetes

Kubernetes is a powerful platform for running containerized applications, but its distributed nature can make troubleshooting challenging. This guide covers common troubleshooting patterns, tools, and techniques to help you diagnose and resolve issues in Kubernetes environments.

Common troubleshooting goals

When troubleshooting Kubernetes, you are often trying to answer questions such as:

  • Is the cluster healthy and responsive?
  • Are Pods scheduled and running correctly?
  • Are Services routing traffic as expected?
  • Are ConfigMaps and Secrets properly configured?
  • Are PersistentVolumes and PersistentVolumeClaims healthy?
  • Are applications accessible externally and internally?
  • Are resource limits and quotas causing problems?
  • Are nodes available and healthy?

Core troubleshooting techniques

Checking cluster health

  • Check node status:
  • kubectl get nodes
  • kubectl describe node <node-name>
  • Check component health (if using kubeadm or self-managed control plane):
  • kubectl get componentstatuses

Investigating Pod issues

  • Check Pod status:
  • kubectl get pods -n <namespace>
  • Describe a Pod to view detailed events:
  • kubectl describe pod <pod-name> -n <namespace>
  • View Pod logs:
  • kubectl logs <pod-name> -n <namespace>
  • kubectl logs --previous <pod-name> -n <namespace> (if restarting)
  • Check resource usage:
  • kubectl top pods -n <namespace>

Investigating node issues

  • Check node resource usage:
  • kubectl top nodes
  • Look for scheduling issues in Pod Events (e.g., Insufficient CPU/Memory).
  • Investigate potential disk pressure or memory pressure conditions.

Troubleshooting networking

  • Verify Service configuration:
  • kubectl get services -n <namespace>
  • kubectl describe service <service-name> -n <namespace>
  • Verify Endpoints:
  • kubectl get endpoints <service-name> -n <namespace>
  • Debug DNS resolution:
  • From within a Pod: nslookup <service-name>.<namespace>.svc.cluster.local
  • Inspect Ingress resources (if used):
  • kubectl get ingress -n <namespace>
  • kubectl describe ingress <ingress-name> -n <namespace>

Troubleshooting storage

  • Check PVC status:
  • kubectl get pvc -n <namespace>
  • kubectl describe pvc <pvc-name> -n <namespace>
  • Check StorageClass:
  • kubectl get storageclass
  • kubectl describe storageclass <storageclass-name>
  • Inspect CSI driver Pods (typically in kube-system namespace):
  • kubectl get pods -n kube-system | grep csi
  • kubectl logs <csi-pod-name> -n kube-system

Investigating configuration issues

  • Check ConfigMaps:
  • kubectl get configmaps -n <namespace>
  • kubectl describe configmap <configmap-name> -n <namespace>
  • Check Secrets:
  • kubectl get secrets -n <namespace>
  • kubectl describe secret <secret-name> -n <namespace>

Monitoring and performance tuning

  • Use kubectl top to monitor resource usage.
  • Set resource requests and limits carefully to prevent evictions and throttling.
  • Review node capacity and consider scaling the cluster if resource pressure is high.

Common symptoms and troubleshooting patterns

SymptomCommon causesSuggested checks
Pods stuck in PendingInsufficient resources, unschedulable nodes, missing PVCskubectl describe pod, kubectl describe node
Pods in CrashLoopBackOffApplication errors, configuration issues, failed dependencieskubectl logs, kubectl describe pod
Services not reachableMissing or misconfigured Endpoints, Ingress issues, firewall ruleskubectl describe service, kubectl get endpoints
PVC stuck in PendingStorageClass issues, CSI driver problemskubectl describe pvc, kubectl describe storageclass
High resource usageMisconfigured limits, noisy neighbors, capacity limitskubectl top pods, kubectl top nodes

Useful tools and resources

  • kubectl: Primary CLI for Kubernetes troubleshooting.
  • k9s: Terminal UI to explore and troubleshoot Kubernetes resources.
  • Lens: Desktop Kubernetes dashboard with troubleshooting capabilities.
  • Cloud provider monitoring tools (e.g., AWS CloudWatch, GCP Monitoring, Azure Monitor).
  • Prometheus / Grafana: Advanced metrics collection and visualization.

Summary

Troubleshooting Kubernetes involves inspecting multiple layers of the system:

  • Nodes and cluster components
  • Workload Pods and their resource usage
  • Networking, Services, and Ingress
  • Storage and PersistentVolumeClaims
  • Configuration via ConfigMaps and Secrets

Develop a systematic approach using Kubernetes-native tools (kubectl, k9s), observability platforms (Prometheus, Grafana), and your cloud provider’s services to effectively diagnose and resolve issues in your Kubernetes environments.


Could this page be better? Report a problem or suggest an addition!