We experienced a production incident when a Kubernetes HTTP liveness probe, paired with a Pod CPU limit, created a feedback loop that could keep the Pod permanently down in a CrashLoopBackOff state. Without changing any application code, we resolved the instability by changing the Deployment configuration. We’re sharing this story because this pattern is likely a common risk for anyone deploying on Kubernetes or non-Kubernetes-based platforms.
These days, almost everything is a distributed system. Distributed systems are full of feedback loops. In our case, even a single backend process serving an internal HTTP API effectively became a feedback loop because it was monitored by a Kubernetes HTTP liveness probe, which kills and restarts the process if it fails to answer a health check HTTP request in a timely manner.
Feedback loops are great, and the intention of this liveness probe is to automatically improve the stability of the service by restarting when it goes down. That’s a great use case for a feedback loop! But as engineers know, feedback loops have a dark side: they can become unstable and can force the system away from stability.
Here’s what happened in our case:
- We’d deploy our Deployment for an internal stateful HTTP server, and it would work just fine for many days.
- Randomly, due to a burst of demand, this server would become busy.
- A CPU limit on the Pod was sized adequately for steady-state operation, but would throttle on larger transients.
- As the process was busy and being throttled, a few Kubernetes HTTP liveness probes would hit their default 1-second timeout, causing kubelet to restart the Pod.
- In the time that it took the Pod to restart, the consumers of this service would build up an even larger backlog of work.
- When the Pod restarted, it was immediately deluged with requests as it tried to catch up with the backlog.
- Because of the much higher than steady-state load, the HTTP liveness probes against the Pod would often fail again in this catch-up period, triggering yet another restart from kubelet.
- Kubernetes, on seeing repetitive restarts due to liveness failures, would start the CrashLoopBackOff, an exponential series of delays (10s, 20s, 40s, …) that would make the Pod stay down for even longer between restarts.
- This increasing back-off delay led to an even greater backlog of work, and so an even smaller chance that a liveness probe would succeed.
As a result, without human intervention, the service would stay down permanently because of the liveness probe, the very mechanism that was intended to make sure the service stayed up!
It’s interesting to note that this interaction was due to the combination of the liveness probe and the CPU limit. If we had only the liveness probe, and no CPU limit, it’s likely that any backlogs would be processed fast enough. And if we only had the CPU limit and no liveness probe, it’d manage the steady-state load just fine. Only when these two configurations were combined did we experience an unstable Deployment.
Stabilizing the System
Here are the three changes we made to resolve this instability:
- Removed CPU limits on all Pods. We still have set CPU requests, but no limits. (There seems to be little upside and much downside to setting CPU limits: see this article about how CPU requests without limits provides CPU sharing fairness with burstability.)
- Longer timeout for liveness probes. The default
timeoutSecondsis only 1 second, and we increased this to 5 seconds.
- Higher consecutive failure count for liveness probes. The default is to restart the Pod after three failures, but we raised
failureThresholdto require 10 consecutive failed liveness probes.
With more lenient liveness probes, the process is far less likely to be inadvertently killed by kubelet unless it’s truly not making any progress. With no CPU limit, the backlog can be cleared faster after any restart or redeploy by using the node’s full available CPU.
timeoutSeconds is especially important to look at in the overload case. Under typical steady-state operation, the service would respond to its liveness check endpoint in perhaps a millisecond or two, so the default 1 second sure seemed like enough. But during an overload, the kubelet’s liveness probe HTTP request would be queued behind lots of other ordinary requests, and so would sometimes take a second or two to respond. Even if the HTTP server was genuinely making progress, the request queue could make it appear to the liveness probe that it wasn’t.
It can be frustrating to tune all of these “analog” knobs and magic numbers (timeouts, failure counts, and resource requests and limits) just to get software deployed. In fact, no changes to any application code were needed to stabilize this service!
Even if you’re not using Kubernetes directly, you may be vulnerable to this class of issue:
- Cloud Load Balancers do health checks, which could disqualify and remove a backend that was slow to respond.
- If this happens, you’re vulnerable to the same undesirable feedback loop, where the remaining backends each get a larger share of requests, potentially getting overloaded themselves.
- In other cases, a PaaS may monitor your HTTP endpoint directly and have their own timeouts and restart policies, explicitly or implicitly.
- If they don’t, that's an extra reason to make sure you have an external monitoring and alerting system in place.
But there seems to be no avoiding the fact that designing reliable distributed systems requires a large number of manually-specified timeouts. For example, our Heii On-Call users have to make deliberate tradeoffs when they specify alerting timeouts. Whether you’re monitoring background processes, cron jobs, websites, or API endpoints, if the specified timeout is too short, the system doesn’t have time to fix and restabilize itself before an alert is sent. In this case, the engineer on-call will get paged far too often! (This may lead to a human feedback loop of your team starting to ignore all the alerts from your monitoring system…) Of course, if the timeout is too long, downtime that truly requires human intervention will be allowed to persist longer than necessary.
It all boils down to a classic statistical tradeoff of false-negative vs. false-positive error rates. Both types of errors are bad, but they usually aren’t equally bad. But errors are especially bad when a false-positive (liveness probe failure when the system is merely catching up on a backlog) triggers an undesirable, self-reinforcing feedback loop (restarting unnecessarily, and therefore increasing the backlog) as we experienced. Kubernetes liveness probes have a default timeout of just 1 second, and require just three failures to trigger a restart. These defaults may be too aggressive for many services to achieve stable operation, especially if CPU limits were applied that don’t allow for bursty workloads and catching up with a backlog if a restart does happen.
As a rule, we are removing all CPU limits and increasing
failureThreshold values across all our services, and only using lower values if the situation absolutely requires it.