Max Veit

We, in the Cloud Engagement Hub, have a lot of experience with Kubernetes. Typical topics that we are involved in are migrating and modernizing workloads as well as building new cloud-native workloads and running them in production afterwards. During many engagements with different clients, we have experienced a lot of different issues and discovered best practices. I want to share some of these best practices, specifically those related to running applications in a highly available fashion in a Kubernetes cluster, with this post. The structure of this post is first an explanation of different error causes with brief explanations of the caused actions by Kubernetes / created issues. Afterwards, mitigation strategies for each of these will be described.

For additional insight, read our earlier blog posts on Autoscaling in Kubernetes.

A node fails (physically)

This leads to the immediate shutdown / failure of all pods on that particular node. What happens afterwards is:

A) Kubernetes tries to re-schedule all pods that were on this node, on other nodes. It is important, that the remaining nodes have sufficient capacity to host the remaining nodes.

B) If a Pod cannot be re-scheduled due to missing resources, Kubernetes will wait until more resources will be available or evict pods based on priorities.

C) The applications fail immediately, no graceful shutdown is possible. To guarantee no failing requests, the application initiating the request has to perform retries. This also includes potentially a mobile app or web-browser, as the request might be lost if for example the node with an ingress pod crashes, where the request is currently being processed.

A node is drained, for example to install updates (Kubernetes or OS)

Case A) from the last use-case applies to this one in the same way. Further, there are two other cases:

D) The services receive a sigterm signal and have the option to gracefully shutdown. This allows a service to finish his current requests, thus making the shutdown transparent for any calling service.

E) By default, no new pod is instantiated before the old pod is terminated. This leads to a fewer number of pods in your ReplicaSet for a short time.

An application (in a pod) fails

In case of a single pod failure, this is noticed through the liveness and readiness probes. This requires you to set appropriate times for these, to assure that the outage is noticed as fast as possible.

F) Once the readiness probe failed often enough, Kubernetes will kill the pod and re-schedule a new pod.

Pods are re-scheduled due to changing load on the worker nodes / added worker nodes

For example, during a load spike or because of a new application being provisioned, it might happen that your services are re-scheduled to different nodes. This is the same case as C), where a graceful shutdown can apply if implemented. Furthermore, case E) applies in this scenario.

Strategies to tackle all of these scenarios will be discussed in this post.

For the sake of simplicity and not make this blog post too long, we are assuming that you are operating in a multi-master, multi-datacenter Kubernetes cluster. Further it is assumed, that all upfront load-balancing and probing for availability of services is in place. This blog post solely looks at the availability challenge inside of a Kubernetes cluster.

ReplicaSets

Applicable cases: A)

The very basis of high availability in Kubernetes, nowadays handled mostly transparent through the Deployment definition. It is important to have more than one instance of each of your services, to anticipate unplanned outages and failures. Typically, each service should have at least three instances working where at least one can fail with the remaining two being capable of covering the full workload.

PodDisruptionBudgets

Applicable cases: E)

For example, when you are shutting down worker nodes, or when the load of various containers changes, pods might be re-scheduled to other worker nodes. This might lead to the issue, of pods being shut down without having a replacement available. This can be especially problematic when you only have one instance of a pod available but is also an issue during load situations or when by coincidence multiple of your pods are scheduled on the same node. To resolve this situation, PodDisruptionBudgets (PDBs) should be used. Using PDBs, you can define how many Pods have to be available at minimum at all times. If Pods are to be shutdown, which would lead to a below minimum number of pods, a new pod is first scheduled, and the old pod is only shutdown after the new one is available and ready.

PDBs have no effect when an application or a whole node fail, e.g. because the application crashed, or the node had a hardware failure.

Pod(Anti)Affinity

Applicable cases: A)

When scheduling pods with a deployment without additional settings, they might be scheduled on any worker node. This might lead to situations, where you have the scaling set to two instances, and both instances reside on the same worker node. This can lead to inconvenient situations during situations like worker node drains or failures. Therefore, you can implement PodAffinity rules, to e.g. schedule your pods evenly to two different datacenters, prevent pods from being scheduled together on the same nodes or enforce pods to be scheduled together on the same node.

Container Lifecycle

Applicable cases: D)

Frameworks like SpringBoot, per default do not shutdown everything gracefully. That means, that a request that is currently being processed, is not guaranteed to be finished before the relevant thread is killed. Basically, there are three strategies to solve this:

1. The service in question will simply be retried and this is a non-issue for you, since you don’t care that some requests are not finished.

2. You implement a graceful shutdown in the application, that e.g. waits a certain amount of time for threads to be finished before they are forcefully killed.

3. You utilize the Kubernetes container lifecycle management to implement a “preStop” step, that for example executes a sleep command for a number of seconds.

Which one of these strategies best applies for your case, depends on the specific problem you are trying to solve. In most cases, I suggest to either implement the services in a way, that they are resilient against failing calls to other services, thus simply perform a retry, or perform a graceful shutdown within the application. Implementing the “preStop” in Kubernetes, is either a best-guess (you don’t know how long the wait actually has to be) or you have to put more logic in your application or at least container again.

Keep in mind, that only option 1) would be helpful in case of a complete node failure (hardware).

Pod Priority and Preemption

Applicable cases: B)

Pod priorities and preemption are going hand in hand. Basically, you can sign priorities to pods. If you have limited resources available, leading to a pod not being scheduled due to insufficient resources available, Kubernetes tries to preempt (evict) another pod with a lower priority that would lead to the original pod being scheduled. If the eviction of such pod does not lead to sufficient free resources to schedule the new pod, the pod will not be evicted.

In general, you should always aim to have sufficient resources available in your cluster for all pods to be scheduled. This can however be interesting in case of a datacenter failure, if the remaining resources cannot run all applications, you might be interested in prioritizing some services over others.

Retries

Applicable cases: C)

In a cloud-native container world, applications have to be built for failure as there are many more moving parts than in traditional monolithic setups. E.g., the pod you just tried to call from your pod, might just be evicted or re-scheduled, thus making your request fail. A simple solution to this is, to implement retries. When a request made by one container fails, it can simply be retried a certain amount of times, typically with increasing wait-times between tries. Service mesh frameworks like istio can help with this behavior to be application agnostic, by performing this outside of the actual application in a sidecar container, at least for http requests.

Liveness, Readiness and Startup Probes

Applicable cases: F)

Liveness and readiness probes basically have two purposes:

  1. Detect that the container is up and running initially
  2. Detect that a previously running container is no longer available for processing

The purpose of the three probe types is as follows:

Startup probe: Especially useful for applications with a long start-up time. This probe is only relevant during the initial start-up of the container. Once it succeeds, the liveness probe takes over. If the container does not start within the time given in the probe definition, the container is killed and re-scheduled. The liveness and readiness probes are de-activated until the startup probe succeeded.

Liveness probe: determine if the container is generally healthy. If this probe fails often enough and hits the defined threshold, the container is restarted. Should be set application specific, to detect failures fast, but not get false negatives in case of load spikes.

Readiness probe: determine if the container is ready to process requests. If this probe fails, the container is set to unready and removed from the service routing. Failure of this probe does not lead to restarts of the container. This can also be used to make a container temporarily unavailable on purpose, e.g. for a long running process during which no new requests should be accepted.

Applicable cases: A), B)

With requests and limits, you can define which resources (typically cpu and memory) is guaranteed (requests) and which resources might be provided beyond that (limits). In a productive environment, all pods and containers should be assigned with resource requests, to assure that not one ill-running container is consuming too much resources and affecting others. With set requests, for example a container with a memory leak might just be terminated every few hours, which is annoying, but keeping your cluster up and running. With no request (or limit) set, the container might take all of the memory on that node, until it crashes, thus affecting other services.

In general, the request defaults to the limit, if no request is set. Also, only the request counts towards the budget of the node or quotas set on namespace level. It should be taken into account, that it is very likely that a pod is terminated, if it exceeds its limit. A pod will likely be evicted, if it’s beyond its request and the Kubernetes scheduler determines, that more resources are needed to schedule another pod.

In general, in most environments, resource requests and limits for memory should be the same while CPU can be overcommitted, as only the exceeding of memory requests or limits leads to a termination, while CPU is simply throttled.

Container resiliency and high availability in Kubernetes is a complex topic. All of the use cases above have to be considered and fine tuning the various components is needed. For fast recovery, things like liveness and readiness probes should activate pod terminations as fast as possible, without affecting pods that are actually still working. Same applies for retries, where there is a fine line between a resilient behavior and overloading your failed services with too many retries.