Blog
Engineering
Product
11
minutes

How to achieve zero downtime on kubernetes: a Day-2 architecture guide

Achieving zero-downtime deployments on Kubernetes requires more than running multiple pods. It demands a standardized architecture utilizing Pod Disruption Budgets (PDBs), precise liveness and readiness probes, pod anti-affinity, and graceful termination handling. At an enterprise scale, these configurations must be enforced via a centralized control plane to prevent catastrophic configuration drift.
March 27, 2026
Pierre Mavro
CTO & Co-founder
Summary
Twitter icon
linkedin icon

Key points:

  • Enforce redundancy globally: Running at least two replicas alongside a strict Pod Disruption Budget (PDB) is the non-negotiable baseline for surviving node failures and cluster maintenance.
  • Automate health diagnostics: Liveness and readiness probes dictate how the kubelet routes traffic and self-heals broken pods during seamless rolling updates.
  • Abstract the configuration toil: Managing these YAML configurations across thousands of clusters manually destroys engineering velocity. Centralized management platforms automate zero-downtime standards without expanding DevOps headcount.

Pulling a container image and deploying a pod is straightforward. Keeping that application highly available during node failures, traffic spikes, and infrastructure upgrades is a complex engineering challenge.

Kubernetes provides the native primitives required to achieve true zero-downtime deployments, but it does not apply them automatically. Engineering teams must explicitly define how the orchestrator handles traffic routing, health checks, and termination signals.

In this architectural guide, we define the strict Day-2 operational standards required to achieve zero downtime on Kubernetes, and how to scale these configurations across a global fleet.

The 1,000-cluster reality: standardizing zero downtime at scale

Configuring zero downtime for a single application is a routine technical task. Enforcing these configurations across thousands of microservices and hundreds of global clusters is a massive Day-2 operational liability.

Without an automated, centralized control plane, platform engineers must manually define and maintain Pod Disruption Budgets, affinity rules, and custom probes via disparate YAML files. This manual approach inevitably leads to configuration drift, dropped connections during scaling events, and prolonged outages during routine node drains. To survive at an enterprise scale, organizations must abstract this manual configuration away from developers, utilizing an agentic management platform to enforce standard zero-downtime rollouts automatically.

🚀 Real-world proof

Getsafe faced escalating costs, compliance hurdles, and critical downtimes on legacy infrastructure that halted their rapid scaling.

⭐ The result: By utilizing Qovery to abstract their Kubernetes deployments, Getsafe eliminated downtime during critical upgrades, reduced infrastructure costs, and achieved full regulatory compliance. Read the Getsafe case study.

1. Control your container image registries

In a production environment, relying on a public or unauthenticated image registry introduces an immediate single point of failure. If the external registry experiences an outage, or an image tag is overwritten, your cluster will throw an ImagePullBackOff error, halting scaling events and rollbacks.

Enterprise platform teams must synchronize container images to private, dedicated registries hosted within their cloud provider account. A centralized control plane automates this process, ensuring that an unavailable external registry never impacts a live production workload.

2. High availability through replicas

Relying on a single application instance guarantees downtime. A common misconception is that a single replica survives rolling updates because Kubernetes starts a new instance before shutting down the old one. While true for basic deployments, this rule does not apply to underlying infrastructure failures.

If a node crashes, or the cluster initiates a node drain (such as during an EKS upgrade), a single pod receives a SIGTERM signal and enters a TERMINATING state. The service stops sending traffic, resulting in immediate downtime while the scheduler waits to pull the image and attach disks to a new node. Running a minimum of two replicas is the absolute baseline for high availability.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: enterprise-app
spec:
  replicas: 2

3. Enforce pod disruption budgets (pdb)

A PodDisruptionBudget (PDB) limits the number of concurrent disruptions that your application experiences during voluntary disruptions, such as cluster maintenance or upgrades.

If an application runs three replicas, a PDB ensures that at least two pods remain active at all times, preventing the orchestrator from taking the entire service offline simultaneously.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: standard-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: enterprise-app

K8s Production Best Practices

Cut through the complexity. Get actionable configurations to slash cloud costs by 30-70%, prevent downtime, and lock down your cluster security.

Kubernetes Best Practices for Production

4. Configure rolling update strategies

Kubernetes offers two primary deployment strategies: Recreate (which forces the application to shut down entirely before starting the new version) and RollingUpdate.

To avoid downtime, RollingUpdate must be applied and tuned using the maxUnavailable and maxSurge parameters. This controls the speed of the deployment, ensuring that enough legacy pods remain active to handle traffic while new pods initialize.

5. Automate deployment rollbacks

Kubernetes does not automatically revert a failed deployment to a previous state natively. If an application crashes on boot, the deployment stalls.

At scale, platform teams must utilize centralized Day-2 platforms or deployment tools (like Helm or ArgoCD) configured with atomic rollbacks. If the health probes of the newly deployed pods fail to return a healthy status within the timeout period, the system must automatically terminate the new pods and restore traffic strictly to the previous stable version.

6. Master liveness and readiness probes

Probes are the diagnostic backbone of zero downtime.

  • Liveness probes dictate pod survival. If this probe fails, the kubelet kills the pod and restarts it with an exponential backoff.
  • Readiness probes dictate traffic routing. If this probe fails, the pod remains alive, but the load balancer immediately stops sending it HTTP requests.

A simple TCP check is insufficient for enterprise environments. Teams must configure custom HTTP endpoints within their applications to accurately reflect database connectivity and cache health.

7. Tune the initial boot time delay

Heavy enterprise applications (like large Java Spring Boot services) require significant CPU time to initialize schemas before they can accept traffic. If a liveness probe fires before the boot sequence completes, Kubernetes will trap the pod in an infinite restart loop.

Use the initialDelaySeconds parameter to allow the application adequate time to boot before the kubelet begins polling for health.

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60

8. Handle graceful termination (sigterm)

Ignoring termination signals results in dropped user connections and corrupted database transactions. When Kubernetes terminates a pod, it sends a SIGTERM signal. The application must be programmed to intercept this signal, finish processing active HTTP requests, close database connections gracefully, and then exit.

If the application ignores the SIGTERM, Kubernetes waits for the terminationGracePeriodSeconds (defaulting to 30 seconds) before executing a hard SIGKILL.

9. Implement pod anti-affinity

Deploying 50 replicas provides no high availability if the scheduler places all 50 pods on the exact same physical node. Pod anti-affinity forces Kubernetes to distribute replicas across different nodes or availability zones.

  • Soft anti-affinity (preferredDuringSchedulingIgnoredDuringExecution) attempts to separate pods, but will group them if node resources are exhausted. This is highly cost-effective for FinOps control.
  • Hard anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) strictly forbids pods from sharing a node, ensuring absolute isolation at the cost of requiring more underlying infrastructure.

10. Define strict resource requests and limits

Failing to define CPU and memory limits guarantees downtime. Without memory limits, an application with a memory leak will trigger an Out Of Memory (OOM) kill from the Linux kernel. Without CPU limits, a single pod can monopolize node resources, starving critical system daemonsets and causing the node to become unresponsive.

11. Configure horizontal pod autoscaling (hpa)

Autoscaling prevents downtime during severe traffic spikes by dynamically provisioning new replicas based on CPU utilization or custom metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Autoscaling is not magic. It relies entirely on the successful configuration of probes, graceful terminations, and accurate resource requests. By abstracting these 11 configurations into a centralized platform engineering strategy, organizations can scale to thousands of clusters while guaranteeing zero downtime and eliminating manual YAML toil.

K8s Production Best Practices

Cut through the complexity. Get actionable configurations to slash cloud costs by 30-70%, prevent downtime, and lock down your cluster security.

Kubernetes Best Practices for Production

FAQs

Q: Why does a single Kubernetes replica cause downtime during node drains?

A: When a cluster initiates a node drain for maintenance or an upgrade, the active pod receives a SIGTERM signal and stops accepting traffic. If there is only one replica, the service experiences immediate downtime while the scheduler waits to provision a replacement pod on a new node. A minimum of two replicas is strictly required to maintain traffic routing during this hardware transition.

Q: What is the difference between a Liveness probe and a Readiness probe?

A: A liveness probe determines if a pod is healthy; if it fails, the kubelet kills and restarts the container. A readiness probe determines if the pod is capable of handling HTTP requests; if it fails, the pod remains alive, but the load balancer automatically stops routing user traffic to it until it recovers.

Q: How does Pod Anti-Affinity prevent Kubernetes outages?

A: If multiple replicas of an application are scheduled on the exact same physical node, a single hardware failure will take down all instances simultaneously. Pod anti-affinity rules force the Kubernetes scheduler to distribute replicas across different nodes or geographic availability zones, isolating the blast radius of hardware crashes.

Share on :
Twitter icon
linkedin icon
Tired of fighting your Kubernetes platform?
Qovery provides a unified Kubernetes control plane for cluster provisioning, security, and deployments - giving you an enterprise-grade platform without the DIY overhead.
See it in action

Suggested articles

Kubernetes
 minutes
Stopping Kubernetes cloud waste: agentic automation for enterprise fleets

Agentic Kubernetes resource reclamation is the practice of using an autonomous control plane to continuously identify, suspend, and delete idle infrastructure across a multi-cloud Kubernetes fleet. It replaces manual cleanup and reactive autoscaling with intent-based policies that act on business state, eliminating the configuration drift and cloud waste typical of unmanaged fleets.

Mélanie Dallé
Senior Marketing Manager
Platform Engineering
Kubernetes
DevOps
10
 minutes
Kubernetes: the enterprise guide to fleet management at scale

Kubernetes is an open-source platform that automates the deployment, scaling, and management of containerized applications. While originally designed to orchestrate single-cluster workloads, modern enterprise use cases require managing Kubernetes at fleet scale, coordinating thousands of clusters across multi-cloud environments to enforce cost governance, security policies, and automated lifecycle management.

Morgan Perry
Co-founder
AI
Compliance
 minutes
Agentic AI infrastructure: moving beyond Copilots to autonomous operations

The shift from AI copilots to autonomous agents is redefining infrastructure requirements. Discover how to build secure, stateful, and compliant Agentic AI systems using Kubernetes, sandboxing, and observability while meeting EU AI Act standards

Mélanie Dallé
Senior Marketing Manager
Kubernetes
8
 minutes
The 2026 guide to Kubernetes management: master day-2 ops with agentic control

Effective Kubernetes management in 2026 demands a shift from manual cluster building to intent-based fleet orchestration. By implementing agentic automation on standard EKS, GKE, or AKS clusters, enterprises eliminate operational weight, prevent configuration drift, and proactively control cloud spend without vendor lock-in, enabling effective scaling across massive fleets.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
 minutes
Building a single pane of glass for enterprise Kubernetes fleets

A Kubernetes single pane of glass is a centralized management layer that unifies visibility, access control, cost allocation, and policy enforcement across § cluster in an enterprise fleet for all cloud providers. It replaces the fragmented practice of switching between AWS, GCP, and Azure consoles to govern infrastructure, giving platform teams a single source of truth for multi-cloud Kubernetes operations.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
 minutes
How to deploy a Docker container on Kubernetes (and why manual YAML fails at scale)

Deploying a Docker container on Kubernetes requires building an image, authenticating with a registry, writing YAML deployment manifests, configuring services, and executing kubectl commands. While necessary to understand, executing this manual workflow across thousands of clusters causes severe configuration drift. Enterprise platform teams use agentic platforms to automate the entire deployment lifecycle.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
Terraform
 minutes
Managing Kubernetes deployment YAML across multi-cloud enterprise fleets

At enterprise scale, managing provider-specific Kubernetes YAML across multiple clouds creates crippling configuration drift and operational toil. By adopting an agentic Kubernetes management platform, infrastructure teams abstract cloud-specific configurations (like ingress controllers and storage classes) into a single, declarative intent that automatically reconciles across 1,000+ clusters.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
Cloud
AI
FinOps
 minutes
GPU orchestration guide: How to auto-scale Kubernetes clusters and slash AI infrastructure costs

To stop GPU costs from destroying SaaS margins, teams must transition from static to consumption-based infrastructure by utilizing Karpenter for dynamic provisioning, maximizing hardware density with NVIDIA MIG, and leveraging Qovery to tie scaling directly to business metrics.

Mélanie Dallé
Senior Marketing Manager

It’s time to change
the way you manage K8s

Turn Kubernetes into your strategic advantage with Qovery, automating the heavy lifting while you stay in control.