Troubleshooting Kubernetes Cluster: A How-To Guide

Introduction

In the world of containerized applications, Kubernetes has become the go-to orchestration platform for containerized applications due to its scalability and flexibility. However, the road to managing a kubernetes cluster isn’t always smooth, dealing with issues in a kubernetes cluster can be tricky. To help you sort out these issues and keep your cluster helathy, we’ll show you some easy steps.

In this blog post, we will explore guidelines for establishing a robust troubleshooting architecture in a kubernetes cluster.

How does kubernetes troubleshooting works?

Kubernetes troubleshooting is solving puzzles in your cluster. When something goes wrong in a kubernetes cluster, troubleshooting helps to figure out what’s broken and how to fix it. You start checking for clues like error messages or unusual behavior. Then, you can use tools like logs and metrics to dig deeper about what’s happening. Once you’ve identified the culprit, you follow a step-by-step process to figure out the root cause of the problem. It’s a bit like solving a puzzle you identify the pieces, put them together, and eventually, you’ll find the solution to make your kubernetes cluster and applications run smoothly again.

Troubleshooting Common Kubernetes Errors:

Here are simple steps to troubleshoot common kubernetes errors.

1. Monitor Cluster Health

Implement monitoring tools to track the overall health of your cluster, including metrics like CPU usage, memory usage and application-specific metrics.

Use kubectl to check the current state of your cluster. For example, run kubectl get nodes to see the status of your nodes and kubectl get pods -n <namespace> to check the pods in a specific namespace.
If it is not running, view the logs of control plane components for error messages. You can use kubectl logs -n kube-system.

Node NotReady Error:

				
					$ kubectl get nodes

				
					NAME  STATUS   ROLES  AGE  VERSION
cpu2  Ready    none   33d  v1.27.5
cpu3  NotReady none   33d  v1.27.5
cpu1  Ready    none   33d  v1.27.5

Run kubectl describe node <node name> to error messages.
Ensure the node has enough resources (CPU, memory) available.
Analyse system logs on the failed node to identify the cause of the issue.
Confirm that the node can communicate with the cluster and other nodes.
Check if the node has any taints that prevent pods from scheduling.
If the kubelet crashes or stops on a node, it cannot communicate with the API server and the node goes into a not ready state.
Sometimes, a simple node reboot can resolve the issue.

2. Inspect Pod Status

Check the status of individual pods in the cluster using kubectl get pods –-all-namespaces. Here are some additional common pod error statuses.

The Pending status means that the pod has been scheduled to run on a node but is waiting for the necessary resources to become available before it can start.
The CrashLoopBackOff pod status means the container repeatedly fails to start, causing the pod to continuously restart and fail and the pod typically indicates a failure in the application.
The ImagePullBackOff pod status means is unable to fetch the specified container image, often due to image unavailability or authentication issues.
The Terminating pod status means is in the process of being gracefully terminated, and its containers are shutting down.
The OOMKilled pod status means one or more containers within the pod have been terminated due to out-of-memory (OOM) errors.
The ErrImagePull pod status means it indicates a failure to pull the container image for a Kubernetes pod, often due to image unavailability or authentication issues.
The Unknown pod status means typically indicates a lack of information about a pod’s current state in a Kubernetes cluster.
The CreateContainerConfigError pod status means that indicates an error in configuring the container within a Kubernetes pod.

How to Fix it (Troubleshooting Steps):

				
					NAME          READY   STATUS    RESTARTS   AGE
sample-pod    0/1     Pending   0          3m11s

To resolve a “Pending” state, Consider the following steps:

The first step is to inspect the pod’s resource requests and limits defined in its YAML configuration. Verify if there are enough available nodes with the necessary resources to schedule the pod. You can use the following command to check node resource utilization. You can use the following command kubectl describe node <node-name>.
Check taints and tolerations are used to restrict which pods can be scheduled on specific nodes. Run the following commands to access events and kubectl logs.

				
					NAME        READY   STATUS            RESTARTS   AGE
test-pod    0/1     CrashLoopBackoff  2          5m11s

To resolve a “CrashLoopBackOff” state, Consider the following steps:

Start by examining the pod’s logs. Use kubectl logs <pod-name> to gain insights into the specific errors or issues causing the crashes. Verify whether the pod has sufficient resources allocated, including CPU and memory.
Verify that the container image and application code are error-free and compatible with the environment.
Check the security context of your pod. And ensure that the permissions and security settings are correctly define.

				
					NAME      READY   STATUS             RESTARTS   AGE
nginxx    0/1     ImagePullBackOff   0          6m11s

To resolve a “ImagePullBackOff” state, Consider the following steps:

If the image is in a private registry, verify that you have the necessary credentials (username and password or token) configured in Kubernetes secrets.
Use kubectl describe secret <secret-name> to confirm the secret details. Ensure that the image specified in your pod’s YAML file exists in the specified container registry.
Review the imagePullPolicy in your pod definition. If set to IfNotPresent or Never, kubernetes won’t attempt to pull a new image if it’s already present on the node. Consider changing this policy to Always if you want to ensure the latest image is always pulled. If you are using a specific tag or version of an image, verify that it’s available.

				
					NAME           READY   STATUS        RESTARTS   AGE
demo-5h2ac     1/1     Terminating   0          1h

To resolve a “Terminating” state, Consider the following steps:

The simplest solution is to forcefully delete the pod using kubectl delete pod –grace-period=0 –force.
If the pod’s termination issue is caused by the application itself, dig into the application logs and code to identify and fix the underlying problems using using kubectl logs <pod-name>.

				
					NAME   READY   STATUS         RESTARTS   AGE
test   0/1     ErrImagePull   0          9m11s

To resolve a “ErrImagePull” state, Consider the following steps:

Start by examining the pod’s YAML configuration file. Ensure that the image field specifies the correct image name, including the repository, tag, and authentication credentials if required.
If you are using a secret for image authentication, ensure that it is correctly associated with the service account used by the pod.
Confirm that the container image specified in the pod configuration exists in the image registry (e.g., Docker Hub, Google Container Registry, or a private registry).

				
					NAME    READY   STATUS      RESTARTS   AGE
my-pod  0/1     OOMKilled   0          12m8s

To resolve a “OOMKilled” state, Consider the following steps:

Start by determining why the pod ran out of memory. Check the application logs for clues. If the pod’s resource limits were too low, increase them to a level that allows the application to run without running out of memory. You can modify the pod’s YAML file or use kubectl edit <pod-name> to make the changes.
You can restart the OOM-killed pod manually using the kubectl delete and kubectl apply commands. Be sure to use the updated configuration with resource adjustments.

				
					NAME            READY   STATUS    RESTARTS   AGE
test-pod-7h5a4  0/1     Unknown   0          9m10s

To resolve a “Unknown” state, Consider the following steps:

Check the status of the nodes in your cluster using kubectl get nodes to ensure they are healthy and not in a “NotReady” state.
Run kubectl describe pod -n <namespace> to gather information about the pod’s recent events.

				
					NAME           READY   STATUS                       RESTARTS   AGE
sample-pod1    0/1     CreateContainerConfigError   0          1m7s

To resolve a “CreateContainerConfigError” state, Consider the following steps:

Review the pod’s YAML configuration file to ensure that all settings are correct. Verify the resource requests and limits, and check any volume mounts, env variables, command/args, security context to ensure they are correctly defined.
Consider using the kubectl describe pod <pod-name> and kubectl logs <pod-name> to get detailed information on the specific issue.

3. Check Networking Issues

Verify that container network plugins like Calico or Flannel are correctly deployed. Confirm that network policies align with your application requirements to control traffic. Regularly monitor network performance and resolve any issues promptly to maintain cluster stability. And network problems are common. Use tools like ping, nslookup, or curl within pods to resolve network connectivity.

Here are some common error to resolve these service types.

ClusterIP Service

1. Service Not Reachable

Error: Pod’s can’t reach the ClusterIP service.

Check if the ClusterIP service and pods are in the same namespace. Ensure the service’s selector matches the labels of the pods. And verify that the pods are running and healthy.
Check the spec.ports section of the ClusterIP service YAML. Ensure the target port in the pods matches the service port.

NodePort Service

2. NodePort External Access:

Error: The NodePort service is not accessible from outside the cluster and port is already in use.

Ensure your cluster nodes are in a public network or have proper firewall rules to allow external traffic. Verify that the service is running correctly using kubectl get svc.
Check the labels and selectors in the service definition to ensure they match the labels on the pods. And double-check the service configuration, ensuring that the type field is set to NodePort.
Check the service’s endpoint using kubectl describe svc <service-name> to see if the pods are correctly associated.

LoadBalancer Service

3. Pending LoadBalancer IP:

				
					$ kubectl get svc -n argocd argocd-svc
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
argocd-svc   LoadBalancer   10.105.37.173   pending       80:30047/TCP   110s

Error: The LoadBalancer service is sometimes stuck in a “Pending” state.

Ensure that your Kubernetes cluster has integration with a cloud provider that supports LoadBalancer services.
Check for cloud resource quotas, as they may limit LoadBalancer creation.
If it is your private k8s cluster, MetalLB would be a better fit.

Ingress Controller

4. Ingress-related Errors:

Error: The Ingress resource is created, but it doesn’t route traffic as expected.

Verify that you have an Ingress controller installed and running in your cluster (e.g., Nginx Ingress Controller, Traefik, HAProxy, etc.).
Ensure that DNS resolution is working correctly. Ingress relies on DNS to route traffic based on hostnames.
Check for syntax errors, typos, and incorrect indentation in the YAML file. Ensure that the Ingress resource is correctly defined in your cluster.

5. Kubelet Issue:

Error: The kubelet service is displays as “Not Running” status.

Check if the kubelet service is running systemctl status kubelet. If it’s not running, start it using systemctl start kubelet.
Check Kubelet logs for errors journalctl -u kubelet. You can try restarting the Kubelet service systemctl restart kubelet. Sometimes this can resolve error.
The kubelet configuration file is typically located in /etc/kubernetes/kubelet.conf or /var/lib/kubelet/config.yaml depending on your setup. Check the configuration file for any errors or misconfigurations.
Check if swap is disabled on your node. You can use the following command sudo swapoff -a.

6. Storage Issues:

Error: Persistent volume(PV) and Persistent volume claim(PVC) mismatches. When a PVC does not bind to a PV or the PVC’s capacity is insufficient, your application may not have the required storage.

Check the PVC status using kubectl get pvc to ensure it is in the “Bound” state.
Verify that the requested storage in the PVC matches the capacity of the available PVs. Create or resize a PV to match the PVC’s requirements.

Error: Incorrect storage classes can cause PVC provisioning to fail.

Verify that the desired storage class exists using kubectl get storageclass.
Ensure the corresponding provisioner is correctly configured. Use the correct storage class in your PVC.

Error: If the application containers cannot mount the volumes correctly, they may not function as expected.

Review your pod’s configuration to ensure the volume mount paths are correct.