Webinar - May 21Building Regulated Infrastructure: How Lucis Standardized Security for Global Care
← Articles/No. 522 · Engineering

Everything you need to know about Kubernetes autoscaling at fleet scale

When engineers configure pod autoscaling, they instinctively tie the Horizontal Pod Autoscaler (HPA) to CPU utilization. If the application is actually bound by memory or downstream database connections, the cluster sits idle while incoming requests time out. Diagnosing hundreds of outages reveals a clear pattern: effective elasticity requires scaling on custom application queues, not just default hardware thresholds.

Qovery Team
The Qovery Team
MAR 30, 2026 · 10 MIN
Everything you need to know about Kubernetes autoscaling at fleet scale

Key Points

  • Identify infrastructure bottlenecks: Scaling horizontally on CPU is useless for memory-bound applications. Distribute workloads based on accurate application constraints.
  • Eliminate boot delay: Node provisioning on AWS EKS or GCP GKE takes minutes. Implement priority-class overprovisioning to ensure instant scaling availability.
  • Master custom metric scaling: Use the Prometheus Adapter to scale deployments based on high-fidelity business metrics like active task queues or connections.

When engineers configure pod autoscaling, they instinctively tie the Horizontal Pod Autoscaler (HPA) to CPU utilization. If the application is actually bound by memory or downstream database connections, the cluster sits idle while incoming requests time out. Diagnosing hundreds of outages reveals a clear pattern: effective elasticity requires scaling on custom application queues, not just default hardware thresholds.

Qovery · Kubernetes for the AI era
Simplify Kubernetes - for humans and AI agents
Learn more

Kubernetes is the industry standard for container orchestration, and dynamic autoscaling is its most critical feature for handling high-traffic enterprise workloads. Configuring an application to scale automatically is a fundamental task. Configuring it to scale efficiently without wasting cloud resources or causing API timeouts is a complex operational burden.

This guide explores the mechanics of pod autoscaling, how platform architects identify scaling bottlenecks, and advanced techniques for eliminating node boot delays across multi-cloud environments.

The 1,000-cluster reality: why standard autoscaling fails

Applying a basic HPA CPU threshold works in isolated development environments. In the enterprise reality, scaling across hundreds of clusters spanning AWS and GCP introduces severe timing constraints. Autoscaling is not instantaneous. If a traffic spike exhausts current node resources, Kubernetes must request a new worker node from the cloud provider.

On AWS EKS, this boot time averages two minutes. Once the node boots, pulling a massive container image takes additional minutes. In a production environment, a four-minute scaling delay results in dropped requests and total service degradation. This requires agentic automation to predictively scale and manage cluster state globally.

Day 2 Operations & Scaling Checklist

Is Kubernetes a bottleneck? Audit your Day 2 readiness and get a direct roadmap to transition to a mature, scalable Platform Engineering model.

Download the checklist!

Kubernetes Day 2 Operations & Scaling Checklist
Kubernetes Day 2 Operations & Scaling Checklist

The three dimensions of scaling

To manage infrastructure effectively, platform teams must balance three types of scaling mechanisms.

Horizontal scaling (scaling out)

The application handles increased workload by adding more pod instances. This is managed by the Horizontal Pod Autoscaler (HPA).

Vertical scaling (scaling up)

The application scales by increasing the CPU or memory limits of the existing pods. This is managed by the Vertical Pod Autoscaler (VPA) and is strictly necessary for legacy applications that cannot run in parallel.

Multi-dimensional scaling

This combines horizontal and vertical scaling simultaneously. It requires complex governance to ensure the HPA and VPA do not conflict and create runaway FinOps costs. If both autoscalers watch the same CPU metric, the cluster will enter an infinite scaling loop.

Identifying application limits and bottlenecks

Before applying autoscaling policies, platform engineers must identify the application constraints. Setting an HPA based on CPU utilization will fail if the application is actually bound by memory or network I/O.

  • CPU: Calculation, data compression, and map-reduce tasks. Scaling horizontally on CPU is straightforward.
  • Memory: Data caching and in-memory computing. Memory generally scales vertically. To scale horizontally, the application must be re-architected to distribute workloads.
  • Disk I/O: Big data storage and database flushing. Local disk is performant, but distributed storage is required for horizontal scaling.
  • Network: External API calls and database connections.

Platform teams must conduct rigorous load testing to saturate the service and identify exactly which resource fails first.

Implementing custom metrics with Prometheus

Standard CPU metrics are often insufficient for complex operations. Consider a workload execution engine running infrastructure-as-code tasks. The engine itself consumes minimal CPU, but executing concurrent jobs consumes massive memory.

Because memory is the bottleneck, platform teams must scale based on the number of parallel tasks rather than CPU usage. To achieve this, engineers expose a custom metric in the application code and configure Prometheus to scrape it:

JAVASCRIPT|Prometheus ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: workload-engine
  namespace: production
  labels:
    app.kubernetes.io/instance: workload-engine
spec:
  endpoints:
    - interval: 30s
      port: metrics
      scrapeTimeout: 5s
  selector:
    matchLabels:
      app.io/instance: workload-engine

Using the Prometheus Adapter, the HPA can now scale the deployment based on the exact number of running tasks, rather than arbitrary CPU spikes.

JAVASCRIPT|Advanced HPA using custom Prometheus metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: workload-engine
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: workload-engine
  minReplicas: 1
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: taskmanager_nb_running_tasks
        target:
          type: AverageValue
          averageValue: 0.5

🚀 Real-world proof

kelvin struggled with slow deployment cycles and infrastructure bottlenecks before standardizing their fleet orchestration.

Agents ship fast. Guardrails keep them safe.
Qovery ensures every agent action is scoped, audited, and policy-checked. Start deploying in under 10 minutes.
Try Qovery free

Eliminating boot delay via overprovisioning

To solve the Day-2 operations problem of node boot delays and image pull times, platform architects use an overprovisioning strategy.

By creating dummy pods with a negative priority class, Kubernetes forces the cloud provider to provision extra nodes in advance.

JAVASCRIPT|PriorityClass for overprovisioning
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: fleet-overprovisioning
value: -1
globalDefault: false

The team deploys lightweight pause containers using this negative priority. When real application pods need to scale up rapidly during a traffic spike, Kubernetes instantly evicts the dummy pods.

This allows the real application to schedule immediately on pre-warmed nodes where the heavy container images are already cached. Instead of dealing with Auto Scaling Groups, teams should implement Karpenter to handle rapid node provisioning directly.

Standardizing autoscaling fleets

Autoscaling is not magic. It requires precise configuration, custom metrics, and strategic overprovisioning to satisfy strict Kubernetes cost optimization goals. While configuring these YAML files manually is possible for a single deployment, enforcing these FinOps and scaling standards across thousands of clusters requires an Agentic Kubernetes Management Platform.

JAVASCRIPT|.qovery.yml
application:
  workload-engine:
    build_mode: docker
    auto_scaling:
      min_instances: 1
      max_instances: 30
      cpu_threshold: 75

By leveraging intent-based abstraction like Qovery, platform teams can automate advanced scaling strategies. This ensures high availability and global cost efficiency without the manual toil.

FAQs

What are the three types of Kubernetes autoscaling?

Kubernetes utilizes three primary scaling dimensions: Horizontal Pod Autoscaler (HPA) increases the number of pod replicas, Vertical Pod Autoscaler (VPA) increases the CPU or memory allocation of existing pods, and Cluster Autoscaling increases the underlying physical or virtual worker nodes.

Why does Kubernetes autoscaling experience boot delays?

Autoscaling is not instantaneous. If current nodes lack the required compute resources, the cluster must request a new virtual machine from the cloud provider, which averages two minutes on AWS EKS. Additionally, pulling multi-gigabyte container images onto the new node adds further delay before the application can serve traffic.

What is Kubernetes overprovisioning?

Overprovisioning is a Day-2 strategy to eliminate node boot delays. Platform engineers deploy dummy pods with a negative priority class. This forces the cluster to boot extra nodes in advance. When real workloads need to scale quickly, the cluster instantly evicts the dummy pods and schedules the real application on the pre-warmed nodes.

Qovery Team
About the author
Qovery Team

The engineering, product, and developer experience team behind the Qovery platform.

Next step

Agents ship fast. Guardrails keep them safe.

Qovery ensures every agent action is scoped, audited, and policy-checked. Start deploying in under 10 minutes.