Common Issues with Kubernetes Deployments and How to Fix Them
ModelingKubernetesposted by ODSC Community March 10, 2022 ODSC Community
Kubernetes Deployments are the cornerstone to managing applications in a Kubernetes cluster at scale. They enable users to control how Pods get created, modified, and scaled within a K8s cluster. Moreover, deployments allow users to properly manage application deployments and updates while facilitating easy rollbacks when required.
These deployments must be handled with care as a crucial process in the software delivery process. However, as with any application, Kubernetes deployments can also run into errors. In this post, we will look at common errors relating to Kubernetes deployments and how to fix them.
Invalid Container Image Configuration
This is the most common error that can occur when dealing with Kubernetes deployments. While there can be multiple causes for this error, the simplest is the user specifying an incorrect container image or tag that is not available in the container repository. Another reason for the unavailability of a container is using invalid credentials to authenticate against the repository. Since most production Kubernetes deployments use a private registry to maintain their container images, they require authentication before allowing access to images.
We can easily identify this error by looking at the Pod status. If it shows as ErrImagePull or ImagePullBackOff, it points to an error while pulling the image. We can simply use the kubectl describe command to drill down this error and display it via a message saying “Failed to pull image” in the event log.
Fixing Container Image Configurations
The simplest way to check if we are pointing to a valid image is by trying to pull the image in your local development environment using the docker pull command. Being able to pull the image indicates that it is available in the repo. So you can limit your troubleshooting to see if the proper image registry is configured or the proper credentials are provided for authentication. The recommended method to specify credentials for private registries in Kubernetes is by adding an imagePullSecrets section to the manifest file. So have a look at the secrets to verify if the correct credentials are configured there.
If the image tag fails, try pulling the container without an explicit image tag. If you can pull the image successfully after that, it indicates that the specified tag does not exist in the repository. It can occur due to a simple issue like a typo or a wrong tag. Besides, it can also be caused by an image build error in your CI/CD pipeline, leading to an incorrect tagging or build failure.
Suppose still you could not pull the image even after trying without providing an explicit image tag. It indicates that the container image does not exist in the repository. First, check if you are pointing to the correct registry in your YAML manifest. If so, check if the image is available in the repo. If it is unavailable, it again indicates an outright build failure in your CI/CD pipeline, leading to the image not being pushed to the repository. An important thing to note when debugging these errors is that Kubernetes does not distinguish between different types of image pulling issues. Kubernetes will report the error with ErrImagePull status regardless of whether it is missing image, tag, or incorrect permissions.
Node Availability Issues
When deploying applications in Kubernetes, your containers will get created in nodes. It can be a random node in a cluster or a specific node targetted using affinity or taints and tolerations. A node can also be a physical machine, virtual server, or a managed compute instance. When a node is shut down or crashed, its state changes to NoReady, indicating an issue with the node. There are multiple issues that can cause the Kubernetes node not ready state, such as the following;
- Lack of Resources in the Node to run the Pod. It means that the node does not have the adequate processing power, memory, or disk space to accommodate the resources requested by the Pod. It can either be caused by a node containing a large number of Pods or non-Kubernetes software processes running on the node, straining the resources.
- The Kube-proxy issue causes networking issues. Kube-Proxy runs on each node and manages network traffic between the node and resources both inside and outside the cluster. The termination of this service will block all networking from and to the node, leading to networking issues within the cluster.
- Kubelet issue causing communication breakdown between the control plane and the node. Kubelet runs on each node, enabling the node to be a part of the cluster. If this service crashes or is unavailable, the cluster will not be able to identify the node as a part of the cluster, leading to the node being unrecognized.
Fixing Node Availability Issues
Run the kubectl get nodes command to check if the node is facing an issue. It will reveal all the nodes within the cluster and the states of each node. The nodes showing the Ready state can run pods, while the nodes showing the NotReady state indicate a problem.
Lack of Resources
The easiest way to identify if the node lacks resources is to run the describe command against the node and look at the conditions section. The MemoryPressure, DiskPresure, and PIDPressure will indicate the memory, disk, and processing available for the node. If any of these criteria show the status of True, it indicates where the node lacks resources. After identifying the root cause, users can add additional compute, memory, or disk capacity as needed. Suppose this utilization does not match the pods in the node. In that case, it will indicate an issue within a Pod that is hogging all resources or background process outside of Kubernetes taking up most available resources. This issue can be fixed by stopping or limiting the resource usage of non-Kubernetes processes. If the culprit is a Pod, implementing requests and limits can prevent over-utilization of resources.
Additionally, the Allocated resources section will indicate the resource request and limits in the node. The limits will be over a hundred percent if there are any overcommitted resources. It is vital when managing resources for Pods and containers. The request and limits specified in the manifest directly impact where the pods can be scheduled, as the Kubernetes deployments will run into difficulties if the nodes can not accommodate the resource request. Proper resource management can be a preventive measure to stop nodes from facing issues due to a lack of resources.
Again we can use the describe node command and then look at the conditions sections to identify Kubelet issues. If all the conditions show up as unknown, it points to an issue in kubelet. Unless the entire node is inaccessible, this issue can be easily fixed by login into the node and running the systemctl status kubelet command to identify the status of the service.
If the service state is “active (exited),” kubelet has exited due to some error. Simply restarting the service will most probably fix this issue. The “inactive (dead)” status indicates a crash in kubelet. Dig into the logs via journalctl to identify the reason behind the crash and fix it.
Kube-Proxy and Connectivity Issues
First, ensure the node is accessible to the control panel and see if the Kube-proxy service is running in the node. You can reveal any network issues using the describe node command. It will set the NetworkUnavailbe flag in the Conditions section to True indicating a connectivity issue.
Look at the events and logs of the failing Kube-proxy pod to drill down and identify the issue. Furthermore, check the status of the Kube-proxy demonset as it is responsible for running Kube-proxy on all nodes. Additionally, you should also look at the underlying network like routing rules, security groups, firewalls and ensure that no network configuration is blocking communications between the node and control plane.
Container Startup Issues
Even if all the deployment configurations are correct, a bug in the containerized application can cause the failed deployment. The most common scenario is application crashing at startup, which is indicated by CrashLoopBackOff error. This error indicates that even though Kubernetes is trying to launch the pod, a container keeps crashing, preventing the startup. This issue can be frustrating as it cannot be identified during the deployment, leading to a false sense of successful deployment.
Fixing Container Startup Issues
As with most troubleshooting, we can use the describe pod command to look at the pod configuration and events. Look at the Containers section and check the state of the container. If it shows as Terminated and the Reason shows up as Error, it indicates an issue with the underlying application. The Exit Code is a good place to start as it indicates the response of the system to the process.
Next up is to look at the application logs. It can be done via the kubectl logs command pointing to the crashing pod. However, logs will be unavailable if the container is freshly created or it is not configured to send the logs to stdout. If it’s a fresh container, we can use the kubectl logs with the “–previous” flag to look at the logs of a previous container. Worst case scenario, if the application is not writing any logs, you will need to try running the container locally and see if it starts without an issue and identify external factors like environment variables or storage volumes are misconfigured in the cluster.
Missing or Misconfigured Configuration Files
Kubernetes provides ConfigMaps and Secrets as recommended methods to pass data to applications. In Kubernetes deployments, users often forget to create the ConfigMaps and Secrets beforehand and reference a non-existent configuration file or an incorrect ConfigMap or Secret. It leads to Pods failing to get created. In most cases, Pods will be stuck in the ContinerCreating state or lead to RunContainerError.
Fixing Missing or Misconfigured Configuration Files
Using the pod describe command and looking at the event log will reveal what configuration reference is failing. Then start by looking at the YAML manifests and see if the ConfigMaps and Secrets are referenced in the correct locations. Further, check if they are created and available within the cluster.
How the Pod is trying to access these configuration files will dictate how the issue can be fixed as there are multiple ways to consume ConfigMaps and Secrets within a Pod. For example, if a ConfigMap is referenced as an environment variable and the ConfigMap is unavailable in the targeted namespace, creating the ConfigMap will resolve the issue. When trying to mount a secret as a volume, the event log of the pod will indicate a FailedMount if the secret is unavailable in the namespace. Again creating the secret will fix this issue. When dealing with any configuration, file prevention is a better option than trying to fix the issues, as it will reduce undue deployment headaches.
Persistent Volume Mount Failures
The next common issue faced by users is referencing incorrect volumes in their deployment manifest. This is one of the simplest mistakes. Incorrect reference will lead to Kubernetes deployment issues regardless of whether you are accessing volumes via PersistantVolumeClaims or directly accessing a persistent disk. The Pod will be stuck in the “ContainerCreating” state when an unavailable volume is referenced in the YAML manifest. We can use the describe pod command and look at the Pod events, as it is an umbrella term to pinpoint this error. While it will indicate a successful Pod scheduling, subsequently, you will see FailedMount errors indicating a failure to mount the referenced volume.
Fixing Persistent Volume Mount Failures
The event log of the Pod will clearly indicate which volume caused the mount failure and the properties of the volume. These properties will indicate the volume configurations like the disk name and the zone, depending on the storage backend used for the volume. The user can simply create a volume that matches these properties and retry the deployment to fix the mount failures.
There are many reasons that can cause issues while deploying applications in Kubernetes. This article has covered the common issues applicable to any Kubernetes deployment and steps to mitigate them. The most common issues associated with Kubernetes deployments can be easily fixed if users are aware of how to pinpoint the issue and remedy it.
About The Author –
Hazel Raoult is a freelance marketing writer and works with PRmention. She has 6+ years of experience in writing about business, entrepreneurship, technology and all things SaaS. Hazel loves to split her time between writing, editing, and hanging out with her family.