Blog
DevOps
Kubernetes
minutes

How to reduce AI infrastructure costs with Kubernetes GPU partitioning

Stop wasting expensive AI compute. Learn how to reduce infrastructure costs using Kubernetes GPU partitioning (NVIDIA MIG) and automated scheduling.
Mélanie Dallé
Senior Marketing Manager
Summary
Twitter icon
linkedin icon

Key points:

  • The GPU Waste Problem: Kubernetes natively assigns whole physical GPUs to single pods. For lightweight workloads like AI inference, this results in massive financial waste since you pay for 100% of the GPU but only utilize a fraction of it.
  • The Orchestration Barrier: While NVIDIA’s Multi-Instance GPU (MIG) technology hardware-partitions GPUs, teaching Kubernetes to recognize and schedule these slices requires a complex, fragile stack of DaemonSets, node labels, taints, and affinity rules.
  • Automated Abstraction: As a comprehensive kubernetes management platform, Qovery bridges this gap by translating complex K8s YAML and MIG profiles into a simple developer interface. This maximizes GPU density and ROI without burdening platform teams with endless configuration toil.

Kubernetes was built for CPU and memory, assuming resources are easily divisible. GPUs break this model entirely. When a pod requests a GPU, Kubernetes natively assigns the entire physical card to that container, even if your AI inference workload only uses 15% of it.

This architectural mismatch creates a compounding financial drain:

  • Massive waste: You pay for 100% of an A100 card, driving effective compute costs to $15-$25 per hour for low-utilization tasks.
  • The hardware is ready: NVIDIA’s Multi-Instance GPU (MIG) technology solved physical partitioning years ago.
  • The software is lacking: Kubernetes does not natively understand GPU sub-allocation.

This guide explores how to teach Kubernetes to schedule and isolate GPU slices, the operational toll this takes on platform engineers, and how modern automation makes high-density GPU scheduling effortless.

Teaching Kubernetes to Count

Reducing GPU waste requires work at two distinct layers: the hardware layer, where physical partitioning creates the slices, and the orchestration layer, where Kubernetes learns to schedule containers onto those slices. The hardware side is mostly solved by NVIDIA already, however the orchestration side is where the real work begins.

Hardware Creates the Slices

NVIDIA's Multi-Instance GPU technology, available on A100, H100, H200, and B200 GPUs, partitions a single physical card into up to seven isolated instances. Each MIG slice receives dedicated streaming multiprocessors, memory, and cache. The isolation is enforced at the hardware level, meaning a workload running on one slice cannot access the memory or compute of another.

MIG profiles follow the format `[compute]g.[memory]gb`. An A100 with 40GB of memory can be partitioned into seven `1g.5gb` slices (each with 1/7th of compute and 5GB of memory) or two `3g.20gb` slices (each with 3/7ths of compute and 20GB of memory). The partitioning itself is configured at the node level using `nvidia-smi` or the `nvidia-mig-parted` tool for declarative configuration across a fleet.

Kubernetes Management Makes Them Usable

Three components must be installed, configured, and maintained to bridge the gap between hardware slices and schedulable Kubernetes resources.

The NVIDIA device plugin runs as a DaemonSet on every GPU node. When configured for MIG, it discovers the available slices and registers them with the kubelet as extended resources. Instead of advertising a single `nvidia.com/gpu`, the node exposes resources like `nvidia.com/mig-1g.5gb: 7`, telling the scheduler that seven small GPU slices are available. 

NVIDIA’s GPU Feature Discovery, another DaemonSet, automatically generates labels for each node based on its GPU hardware and MIG configuration. These labels allow the scheduler to differentiate between nodes with different partition profiles. Without accurate labels, the scheduler cannot distinguish a node offering seven small inference slices from one offering two large training partitions.

Platform engineers must install these DaemonSets, keep them versioned alongside the GPU driver, and ensure they restart correctly when nodes cycle.

Operators also need to configure taints on MIG nodes so that only GPU-aware workloads land there. Tolerations on pod specs allow specific containers to schedule on tainted nodes. Node affinity rules can direct heavy training jobs to full GPUs while routing inference workloads to MIG slices.

The Management Toil

Each of these components requires independent maintenance. Changing a MIG profile requires draining the node, reconfiguring the GPU, and rescheduling workloads. The device plugin and GPU Feature Discovery DaemonSets must be updated when NVIDIA releases new driver versions. Node labels and taints must stay consistent across the fleet as nodes are added, removed, or replaced.

The YAML that developers must write to target a specific MIG profile is also nontrivial. A pod requesting a `1g.5gb` slice needs the correct resource limit, a toleration for the GPU node taint, and a node affinity rule to ensure it lands on a MIG-enabled node.

This configuration is error-prone and requires knowledge that most application developers do not have. Every deployment that targets a GPU slice needs precise configuration, and a mistake in any field can result in failed scheduling or misplaced workloads.

Struggling with complex GPU scheduling YAML?

See how Qovery abstracts Kubernetes GPU partitioning so developers can request hardware slices with zero K8s configuration toil.

Container Management Strategies for Density

With GPU partitioning in place, three operational strategies determine how effectively your cluster utilizes its hardware:

  • Bin Packing: The default K8s scheduler spreads pods out, which wastes expensive GPU nodes. Bin packing reverses this: it packs as many workloads as possible onto a single GPU node (e.g., 21 inference services on just three A100s) before allocating a new one.
  • Hardware Isolation: MIG provides hardware-level memory and compute isolation between slices. A crash or memory leak in slice 0 cannot destabilize a neighboring container in slice 1, eliminating noisy neighbor problems in multi-tenant clusters.
  • Resource Quotas: To prevent resource hoarding, Kubernetes ResourceQuota objects cap the total number of MIG slices a specific namespace (e.g., ML inference vs. Training) can request, ensuring fair access across teams.

Qovery: The Scheduler You Do Not Have to Write

While it requires a lot of setup and maintenance by platform operators, the architecture described above works. However, for most mid-size teams, this configuration toil is the primary barrier to adopting GPU partitioning.

Qovery is a Kubernetes management tool that translates hardware capabilities into developer experience. For GPU workloads, this means converting the complexity of MIG profiles, node affinity, and tolerations into a resource selection that developers can use without Kubernetes scheduling expertise.

Instead of writing the YAML block above to request a `mig-1g.5gb` slice with the correct tolerations and affinity rules, developers select a simplified resource class through Qovery's interface. Qovery automatically injects the correct `nodeAffinity`, `tolerations`, and resource limits into the Kubernetes pod spec, ensuring the container lands on the right partition every time.

Developers describe what their workload needs, and Qovery generates the scheduling configuration that Kubernetes requires. The device plugin discovery, node labeling, and taint management still happen at the cluster level, but developers never interact with these layers directly. The platform abstracts the complexity while preserving the underlying infrastructure control that platform teams need.

Qovery also handles the Day 2 operations that make GPU management sustainable at scale. Infrastructure provisioning through the platform includes automated cluster configuration and environment management, so teams can deploy GPU workloads to ephemeral environments for testing without manually configuring MIG profiles on test clusters. 

For organizations managing Kubernetes across multiple environments, the platform becomes an interface for operators to trust and developers to adopt. Qovery handles the ongoing orchestration between application deployments and GPU hardware. Teams get maximum density from their GPU investment without writing or maintaining the scheduling logic themselves.

Conclusion

GPU partitioning with NVIDIA MIG creates the potential for massive infrastructure savings, reducing per-workload GPU costs by up to 85%. But the barrier isn't the hardware, it’s the complex Kubernetes orchestration required to expose partitions, enforce isolation, and maintain quotas across a production cluster.

Qovery provides the kubernetes management platform that makes GPU partitioning practical. By abstracting complex scheduling configurations into simple resource selections, Qovery injects the correct K8s manifests automatically and enforces governance. Your organization gets maximum GPU density with zero scheduling complexity, allowing developers to stay focused on building AI applications.

Share on :
Twitter icon
linkedin icon
Tired of fighting your Kubernetes platform?
Qovery provides a unified Kubernetes control plane for cluster provisioning, security, and deployments - giving you an enterprise-grade platform without the DIY overhead.
See it in action

Suggested articles

DevOps
Kubernetes
Platform Engineering
6
 minutes
10 Red Hat OpenShift alternatives to reduce licensing costs

Is OpenShift too expensive? Compare the top 10 alternatives for 2026. Discover how to transition to Rancher, standard EKS, or modern K8s management platforms.

Morgan Perry
Co-founder
Product
Infrastructure Management
5
 minutes
Migrating from NGINX Ingress to Envoy Gateway (Gateway API): behind the scenes

Following the end of maintenance of the Ingress NGINX project, we have been working behind the scenes to migrate our customers’ clusters from Kubernetes Ingress + NGINX Ingress Controller to Gateway API + Envoy Gateway.

Benjamin Chastanier
Software Engineer
DevOps
Kubernetes
 minutes
How to reduce AI infrastructure costs with Kubernetes GPU partitioning

Stop wasting expensive AI compute. Learn how to reduce infrastructure costs using Kubernetes GPU partitioning (NVIDIA MIG) and automated scheduling.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
Top Nutanix Alternatives for Kubernetes Management

Looking for alternatives to Nutanix Kubernetes Platform (NKP)? Compare the top 10 solutions. Review pros and cons to find tools that offer greater flexibility and lower costs.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
Top Mirantis Alternatives That Developers Actually Love

Explore the top 10 alternatives to Mirantis. Compare pros and cons of modern Kubernetes platforms like Qovery, Rancher, and OpenShift to find your best fit.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
Top 10 enterprise Kubernetes cluster management tools in 2026

Compare the best enterprise Kubernetes management tools for 2026. From Qovery and OpenShift to Rafay and Mirantis, discover which platform best suits your multi-cluster strategy.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
 minutes
Atmosly Alternatives: The Best Tools for Scaling Teams

Hit the ceiling with Atmosly? Discover the top 10 Kubernetes management alternatives for 2026. From Qovery’s developer-centric platform to Rancher’s operations control, find the right tool to scale your infrastructure.

Mélanie Dallé
Senior Marketing Manager
DevOps
 minutes
10 Best Octopus Deploy Alternatives: Trade Manual Deployment for Full Pipeline Automation

Modernize your pipeline. Explore the top Octopus Deploy alternatives for cloud-native Kubernetes delivery and full GitOps integration.

Mélanie Dallé
Senior Marketing Manager

It’s time to change
the way you manage K8s

Turn Kubernetes into your strategic advantage with Qovery, automating the heavy lifting while you stay in control.