Everything I Wanted To Know About Kubernetes Autoscaling



Key Points
- Three Dimensions of Scaling: Effective autoscaling requires balancing Horizontal (Pod count), Vertical (Resource size), and Cluster (Node count) scaling simultaneously to avoid bottlenecks.
- The CPU vs. Memory Split: Scaling horizontally on CPU is straightforward, but memory-intensive applications often require architectural re-designing to handle distributed workloads efficiently.
- Boot Delay Realities: Autoscaling is not instantaneous; you must account for Node boot times (avg. 2 mins on AWS) and Image pull times (which can take minutes for multi-GB images).
Kubernetes is today the most well-known container scheduler used by thousands of companies. Being able to quickly and automatically scale your application is standard nowadays; however, knowing how to do it well is another topic.
In this article, we'll cover how pod autoscaling works, how it can be used, and a specific Qovery internal use case.
There are three kinds of scaling:
- Horizontal scaling: Your application can have multiple instances. As soon as you need to support more workload, new instances will pop up to handle it. Scaling ends when the limit you've set has been reached or when no more nodes can be used to support your workload.
- Vertical scaling: Your application cannot run in parallel, so leveraging the current resources is the way to scale. Scaling issues occur when you reach the physical machine limits. Being able to have multiple instances with vertical scaling is possible but rare.
- Multi-dimensional scaling: Less frequent, it combines horizontal and vertical scaling simultaneously. It's also more complex to manage because defining when to scale horizontally or vertically depends on many parameters.
Know your application limits and bottleneck
Applying autoscaling on the CPU doesn't work all the time because your application limits may not be (only) CPU side. First, define your application limits:
- CPU: Calculation, compression, map reduce...
- Memory: Cache data, store to then compute...
- Disk: Store big data which can't fit into memory, flush on disk...
- Network: Request external data (database, images, videos...), API calls...
If you have written your application, you should know where the bottleneck will happen at first. If you don't, you will have to load test your application to find resources where the contention will occur. For an API REST application, you can use existing tools to perform HTTP calls to try to saturate the service and see on which resource your application struggles.
It's essential to load test in the same conditions as production—thanks to Qovery, cloning an environment instantly is easy! Once you have results, consider these rules for autoscaling:
- CPU: Scaling horizontally on CPU is generally one of the easiest ways to scale.
- Memory: Memory can only scale vertically by design. If your application works this way, re-architecture it to distribute work across several instances is the way to scale.
- Disk: Local disk is the most performant, but storing on a shared drive is preferable if you care less about performance and more about data availability across nodes.
- Network: Scaling horizontally is common, but defining the metric (connection number, latency, throughput) may not be easy.
Qovery use case
At Qovery, we use the Qovery Engine as an example. The engine itself doesn't consume much CPU or memory, but when it runs Terraform, Terraform can consume 500+ Mb per process. To size pod instances correctly, we limit parallel runs of Terraform in a single instance to avoid Out Of Memory (OOM) issues.
Because memory is the bottleneck, we horizontally distributed the workload across several Engines. We created a custom metric based on the Engine number of requests executed in parallel. If an Engine performs tasks, we scale up to handle new ones.
We implemented a metric in the Engine application:

So we implemented a metric for it in the Engine application:
lazy_static! {
static ref METRICS_NB_RUNNING_TASKS: IntGauge =
register_int_gauge!("taskmanager_nb_running_tasks", "Number of tasks currently running").unwrap();
}Prometheus is configured to scrape Qovery metrics every 30s:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: qovery-engine
name: qovery-engine
namespace: qovery-prod
spec:
endpoints:
- interval: 30s
port: metrics
scrapeTimeout: 5s
namespaceSelector:
matchNames:
- qovery-prod
selector:
matchLabels:
app.io/instance: qovery-engineThen, with the Prometheus Adapter, we act on the Pod autoscaler:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: qovery-engine
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: qovery-engine
minReplicas: 1
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: taskmanager_nb_running_tasks
target:
type: AverageValue
averageValue: 0.5Enhance autoscaling pod boot time
Autoscaling pods can take some time for several reasons:
- Boot node: If resources are full, Kubernetes creates a new node (average 2 min on AWS).
- Boot pod (pull image): Pulling multi-GB images can take minutes.
- Application boot delay: Varying by application (e.g., JVM-based).
We eliminate points 1 and 2 using overprovisioning pods. We use a priority class with a value of -1:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: qovery-engine-overprovisioning
value: -1
globalDefault: falseThe deployment for preempting pods ensures real Engine pods replace them instantly because the images are already pulled and resources are allocated:
apiVersion: apps/v1
kind: Deployment
metadata:
name: qovery-engine-overprovisionning
spec:
replicas: {{ .Values.overprovisionning.replicas }}
template:
spec:
priorityClassName: qovery-engine-overprovisioning
containers:
- name: qovery-engine-overprovisionning
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
command: ["/bin/sh", "-c", "tail -f /dev/null"]Conclusion
Autoscaling is not magic. Kubernetes helps, but the most important thing is knowing your application limits and bottlenecks. Taking time to test, validate, and regularly load testing your app is crucial to success.

Suggested articles
.webp)












