Blog
Kubernetes
8
minutes

Kubernetes management in 2026: mastering Day-2 ops with agentic control

The cluster coming up is the easy part. What catches teams off guard is what happens six months later: certificates expire without a single alert, node pools run at 40% over-provisioned because nobody revisited the initial resource requests, and a manual kubectl patch applied during a 2am incident is now permanent state. Agentic control planes enforce declared state continuously. Monitoring tools just report the problem.
April 23, 2026
Mélanie Dallé
Senior Marketing Manager
Summary
Twitter icon
linkedin icon

Key points

  • Proprietary CRDs are the actual lock-in vector: OpenShift Routes, Rancher-specific controllers, and similar vendor resources are what make migrations painful, not Kubernetes itself. Build on vanilla EKS, GKE, or AKS and you retain full workload portability.
  • Drift happens between audits, not during them: Every manual kubectl change made during an incident becomes permanent state by default. Agentic GitOps enforcement detects and reverts those changes automatically, usually within seconds.
  • FinOps at scale is an automation problem, not a visibility one: Dashboards showing where the waste is do not fix it. Karpenter-driven right-sizing and scheduled non-production fleet hibernation do.

Why Day-1 is the wrong thing to optimise for

Writing Terraform to spin up a Kubernetes cluster feels like progress. The API server responds, pods schedule correctly, and everything looks fine. That feeling lasts about six months.

Day-2 is where Kubernetes actually costs you. Not in a dramatic, obvious way. It is a slow accumulation: one certificate that expires because the renewal reminder landed in a busy sprint, one node pool that nobody right-sized after the initial deploy, one replica count changed manually at 3am that never made it back to Git. None of these feel serious on their own. Together, they compound into the kind of operational debt that produces outages on Friday afternoons.

For teams running a handful of clusters, this is a management problem. For teams running dozens or hundreds, it is a systematic failure waiting to happen. The operational surface area grows faster than the team does, and manual processes do not scale to meet it.

This is the real challenge of Kubernetes management in 2026. Not getting clusters running. Keeping them healthy, compliant, and cost-efficient at fleet scale, without burning out the platform team doing it.

The 1,000-cluster reality

There is a common mistake that surfaces when teams first try to scale their Kubernetes operations: they take whatever bash scripts and manual procedures worked for two clusters and apply them to twenty. It works, barely. Then they try it at fifty and it starts breaking. By the time they hit 100 clusters, the scripts are unmaintainable and the team is spending more time on infrastructure management than on anything that actually moves the product forward.

RBAC synchronisation is a good illustration. Keeping role bindings consistent across two clusters is a weekend project. Keeping them consistent across 1,000 clusters manually is not a weekend project. It is an operational liability.

# What manual RBAC drift looks like when you finally audit it
$ kubectl get clusterrolebindings -o json \
  | jq '.items[] | select(.roleRef.name=="cluster-admin") | {name: .metadata.name, subjects: .subjects}'

# At 1,000 clusters, you are running this query centrally
# or you are not running it at all, which is the more common answer

The platform teams that manage large fleets without burning out have one thing in common: they automated the governance layer early, before scale made it mandatory. Agentic automation keeps operational overhead flat as cluster count grows. Without it, every new cluster added to the fleet adds proportional toil to the team running it.

The shift: from proprietary monoliths to modular freedom

There has been a clear move away from heavy proprietary distributions toward modular, agentic platforms built on vanilla Kubernetes. It is not ideological. It is financial.

The concrete problem with proprietary distributions is the exit cost. If you expose a web service in Red Hat OpenShift, you are typically forced to use their Route CRD instead of standard Kubernetes networking primitives:

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: frontend-route
spec:
  host: api.internal.corp
  to:
    kind: Service
    name: frontend-service
    weight: 100
  port:
    targetPort: 8080
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect

The day your organisation decides to move to standard AWS EKS, every one of those proprietary objects needs to be rewritten as a standard Ingress resource. On a fleet of 10 clusters that is a sprint. On a fleet of 100 it is a multi-quarter project, and finance will want to know why engineering is not shipping features. That is what vendor lock-in actually looks like in practice. Not a philosophical argument about open source. A very expensive migration project that could have been avoided.

Platforms like Qovery take an intent-based approach. You declare the outcome:

application:
  name: frontend-service
  ports:
    - external_port: 443
      internal_port: 8080
      protocol: HTTP

The platform generates the correct, standard Kubernetes primitives underneath. If you ever leave, your infrastructure stays intact. No rewriting, no migration debt.

For a deeper look at how the leading options compare on this dimension, the 10 best Kubernetes management tools for enterprise fleets breakdown is worth reading before you commit to anything.

The three foundations of cluster excellence

At fleet scale, successful Kubernetes operations depend on getting three things right. Teams that skip any of these tend to find out the hard way, usually during an incident.

1. Security via agentic enforcement

Static RBAC rules reflect your security intent at the moment you wrote them. Clusters change constantly. Engineers add permissions during incidents. Service accounts accumulate privileges over time. The principle of least privilege does not enforce itself, and annual RBAC audits do not catch what happened last Tuesday.

Agentic security enforcement means continuous audit, not periodic review. AI-driven systems that watch live network traffic and log patterns can detect privilege escalation attempts or unexpected lateral movement before they become a breach. For organisations with SOC 2 or HIPAA requirements, this matters because compliance evidence is generated automatically rather than assembled manually before each audit cycle.

# Detect service accounts with cluster-admin across namespaces
$ kubectl get clusterrolebindings -o json | jq '
  .items[] |
  select(.roleRef.name == "cluster-admin") |
  {
    binding: .metadata.name,
    subjects: [.subjects[]? | {kind, name, namespace}]
  }'

# On a single cluster this takes 30 seconds.
# On 200 clusters, you either automate it or you skip it.
# Most teams skip it.

2. Reliability through immutable GitOps

The reliability argument for GitOps is simple: if the desired state lives in a version-controlled repository, every divergence from that state is detectable and reversible. If the desired state lives in someone's memory of what they applied last month, it is not.

The part that actually makes a difference is enforcement. Not just syncing from Git, but actively overwriting manual changes the moment they are detected. An engineer scales a replica set during debugging and forgets to update the manifest. The agentic control plane reverts it within seconds, and the incident appears in the audit log.

# Flux Kustomization with hard enforcement
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production-fleet
  namespace: flux-system
spec:
  interval: 5m
  path: ./clusters/production
  prune: true    # Removes resources deleted from Git
  force: true    # Overwrites any manual changes
  sourceRef:
    kind: GitRepository
    name: fleet-config

That force: true setting is what separates GitOps-as-best-practice from GitOps-as-actual-enforcement. Most teams have the former and think they have the latter.

3. Efficiency and the FinOps evolution

Cloud waste in Kubernetes is not a mystery. Teams over-provision because the cost of under-provisioning, which is a production incident, is much more visible than the cost of over-provisioning, which is a line item on a monthly bill that nobody scrutinises closely enough.

Fixing this at scale means automating the response, not just improving the visibility. Karpenter does the heavy lifting on node right-sizing:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

Combine that with scheduled hibernation of non-production environments during off-hours and you are looking at real budget recovery, not marginal optimisation. The teams that have done this properly report 30 to 40 percent reductions in cloud spend on non-production workloads without touching a single production deployment.

Mastering Day-2 ops: the 4 critical pillars

Day-1 gets clusters running. Day-2 is everything that happens after, which turns out to be most of the work.

1. Zero-downtime lifecycle management

EKS, GKE, and AKS all deprecate minor Kubernetes versions on roughly 14-month cycles. That sounds manageable until you have 50 clusters and realise that in-place upgrades require draining nodes, surviving API deprecations, and hoping nothing breaks mid-upgrade.

Blue/green cluster upgrades are the pattern that actually works at scale. You provision a new cluster at the target version, validate it with a subset of workloads, shift traffic at the load balancer level, and destroy the old cluster once you are confident. The rollback path is a single load balancer change.

# Blue/green upgrade on AWS EKS
# Step 1: Provision green cluster at new version
eksctl create cluster \
  --name production-green \
  --version 1.32 \
  --region eu-west-1 \
  --nodegroup-name standard-workers \
  --node-type m5.xlarge \
  --nodes 3 \
  --managed

# Step 2: Validate core workloads
kubectl --context=green run smoke-test \
  --image=curlimages/curl --restart=Never --rm -it \
  -- curl -sf http://internal-healthcheck/ready

# Step 3: Shift ALB traffic to green target group
# Step 4: Monitor for 24-48 hours, then delete blue
eksctl delete cluster --name production-blue --region eu-west-1

This approach costs slightly more during the transition window. It costs considerably less than a failed in-place upgrade at 11pm.

2. Combatting configuration drift

Configuration drift is not caused by careless engineers. It is caused by incidents. When something breaks in production, the fastest fix wins. That fix bypasses the normal PR process, gets applied directly via kubectl, and becomes permanent state because nobody has time to clean it up afterwards.

Agentic self-healing addresses this at the platform level. The control plane continuously compares live cluster state against the Git repository. Any delta, whether from a manual kubectl edit, a Helm override that did not make it back to the chart, or a misconfigured admission webhook, gets detected and reconciled automatically. The incident still gets fixed. It also gets properly recorded and reverted when the approved fix is merged.

This is the only approach that actually works at fleet scale. Manual drift remediation across 100 clusters is not a process. It is a backlog that never gets cleared.

3. Advanced observability

Standard monitoring tells you a pod crashed. What you actually need is why it crashed, which upstream service timed out, what the network was doing at the time, and whether it has happened before under similar conditions. These are different questions that require different tooling.

eBPF-based Kubernetes observability via tools like Cilium provides kernel-level visibility into network packets and system calls without adding sidecar proxies to every pod. The performance cost of a full service mesh is real, and on a fleet where you are already watching your cloud bill, adding per-pod overhead to every workload is not a trivial decision.

# Cilium network policy audit — trace dropped packets without sidecars
$ cilium monitor --type drop

# eBPF-based latency histogram — no application instrumentation needed
$ hubble observe --protocol http --verdict DROPPED --last 100

The difference in troubleshooting speed between having this data and not having it is measured in hours of incident time.

4. Automated trust and secrets

Manual certificate rotation is one of the most preventable causes of production outages, and it keeps happening because the failure mode is invisible right up until it is not. A certificate is issued. A calendar reminder is set. The reminder lands during a sprint planning week. The certificate expires. A service fails.

Automating the full lifecycle via cert-manager removes humans from the rotation loop entirely. Pairing it with the External Secrets Operator to pull credentials from HashiCorp Vault or AWS Secrets Manager at runtime means raw passwords never touch etcd:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
  namespace: production
spec:
  refreshInterval: "1h"
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: production/database
        property: password
    - secretKey: DB_USERNAME
      remoteRef:
        key: production/database
        property: username

Every secret access is logged by Secrets Manager. Rotation is verifiable from the ExternalSecret refresh timestamp. When an auditor asks for evidence of secret rotation practices, you can produce it in minutes rather than spending a week assembling spreadsheets.

Mastering K8s Day 2 Ops with AI

Reduce operational toil and resolve incidents faster. Learn how AI-powered agentic workflows and the Model Context Protocol (MCP) securely automate enterprise scale.

Mastering Kubernetes Day 2 Operations with AI

The Qovery advantage: enterprise power, zero weight

Qovery unifies provisioning, security, and FinOps into a single agentic control plane. The pitch is straightforward: enterprise-grade fleet management without the operational weight of building and maintaining that control plane yourself.

The three components that matter for Day-2 operations specifically:

  • AI optimize agent identifies workloads suitable for Spot instances based on historical usage patterns and automatically right-sizes resource requests. It acts on the data rather than presenting it in a dashboard and expecting your team to find time to respond.
  • AI secure agent interprets audit logs continuously and surfaces security posture adjustments in real time. Compliance evidence is generated as a byproduct of normal operations, not assembled manually the week before an audit.
  • Zero lock-in is not a marketing claim here. Qovery manages vanilla Kubernetes. The clusters, node groups, and VPCs are yours. If you leave, nothing breaks. Compare that to the migration cost of unwinding a proprietary distribution and the value proposition is concrete.

The intent-based abstraction means platform engineers define outcomes in simple configuration files. The platform generates correct, standard Kubernetes manifests underneath. No proprietary CRDs accumulating in your clusters.

🚀 Real-world proof

Alan, the French digital health unicorn, was running 50+ Elastic Beanstalk environments with deployments exceeding an hour and a full-time engineer dedicated solely to keeping the platform operational.

The result: Deployment time dropped from 55 minutes to 8 minutes, the dedicated infrastructure FTE was freed entirely, and the team now manages 100+ services with developers deploying independently. Read the Alan case study.

Conclusion: turning infrastructure into a strategic asset

The operational weight of Kubernetes at fleet scale is not a technology problem. The technology exists. Cert-manager handles certificate rotation. Karpenter handles node right-sizing. Flux or Argo CD handle GitOps enforcement. eBPF handles observability without sidecar overhead.

The problem is that assembling and maintaining all of these components yourself, across a growing fleet, while also shipping product features, is not a realistic allocation of engineering time. Something gives. Usually it is the maintenance work, quietly, until it produces an incident.

Agentic Kubernetes management platforms handle the assembly and ongoing operation of that control plane. Your platform team defines the policies. The platform enforces them. Engineering time goes toward the work that actually differentiates the business.

That is the argument. Not that Kubernetes is too hard, but that running it well at scale is a full-time job that your product engineers should not be doing.

FAQs

What is the difference between Kubernetes orchestration and Kubernetes management?

Orchestration is what Kubernetes itself does: scheduling containers onto nodes, maintaining declared replica counts, restarting failed pods. Management is the operational layer above that. It covers version upgrades across a fleet, security patching, cost allocation, RBAC governance, certificate lifecycle, and multi-cloud visibility. Orchestration keeps your workloads running. Management keeps the entire platform healthy, auditable, and cost-efficient over time. Most teams have the former and underestimate the latter until the fleet grows past the point where manual processes can keep up.

How do AI agents improve Kubernetes Day-2 operations?

They replace reactive monitoring with proactive remediation. Traditional monitoring tells you something went wrong. An AI agent detects the conditions that precede failures, such as memory pressure building over hours, a certificate within days of expiry, or a replica count drifting from its declared state, and applies a fix before an outage occurs. The difference is that agents act on the data. At fleet scale, the volume of signals coming off hundreds of clusters is too high for humans to process in real time. Agents are not a convenience at that scale. They are the only viable operating model.

Why does vanilla Kubernetes matter more at enterprise fleet scale?

At small scale, proprietary Kubernetes distributions are a manageable tradeoff: some vendor lock-in in exchange for a polished management experience. At fleet scale, the calculus changes. Proprietary CRDs accumulate across hundreds of clusters. When the organisation eventually wants to migrate, whether to cut licensing costs, change cloud providers, or consolidate tooling, every one of those proprietary resources needs to be rewritten. The engineering cost of that migration grows linearly with fleet size. Building on standard EKS, GKE, or AKS from the start eliminates that future liability entirely.

Share on :
Twitter icon
linkedin icon
Tired of fighting your Kubernetes platform?
Qovery provides a unified Kubernetes control plane for cluster provisioning, security, and deployments - giving you an enterprise-grade platform without the DIY overhead.
See it in action

Suggested articles

Kubernetes
8
 minutes
Kubernetes management in 2026: mastering Day-2 ops with agentic control

The cluster coming up is the easy part. What catches teams off guard is what happens six months later: certificates expire without a single alert, node pools run at 40% over-provisioned because nobody revisited the initial resource requests, and a manual kubectl patch applied during a 2am incident is now permanent state. Agentic control planes enforce declared state continuously. Monitoring tools just report the problem.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
 minutes
How to automate environment sleeping and stop paying for idle Kubernetes resources

Scaling your deployments to zero is only half the battle. If your cluster autoscaler does not aggressively bin-pack and terminate the underlying worker nodes, you are still paying for idle metal. True environment sleeping requires tight integration between your ingress layer and your node provisioner to actually realize FinOps savings.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
10 best Kubernetes management tools for enterprise fleets in 2026

The biggest mistake enterprises make when evaluating Kubernetes management platforms is confusing infrastructure provisioning with Day-2 operations. Tools like Terraform or kOps are excellent for spinning up the underlying EC2 instances and networking, but they do absolutely nothing to prevent configuration drift, automate certificate rotation, or right-size your idle workloads once the cluster is actually running.

Mélanie Dallé
Senior Marketing Manager
DevOps
Kubernetes
Platform Engineering
6
 minutes
10 best Red Hat OpenShift alternatives to reduce licensing costs

For years, Red Hat OpenShift has been the safe choice for heavily regulated, on-premise environments. It operates as a secure fortress. But in the public cloud, that fortress acts as an expensive prison. Paying proprietary per-core licensing fees on top of your standard AWS or GCP compute bill is a redundant "middleware tax." Escaping OpenShift requires decoupling your infrastructure from your developer experience by running standard, vanilla Kubernetes paired with an agentic control plane.

Morgan Perry
Co-founder
AI
Product
3
 minutes
Qovery Skill for AI Agents: Deploy Apps in One Prompt

Use Qovery from Claude Code, OpenCode, Codex, and 20+ AI Coding agents

Romaric Philogène
CEO & Co-founder
Kubernetes
 minutes
Stopping Kubernetes cloud waste: agentic automation for enterprise fleets

Agentic Kubernetes resource reclamation is the practice of using an autonomous control plane to continuously identify, suspend, and delete idle infrastructure across a multi-cloud Kubernetes fleet. It replaces manual cleanup and reactive autoscaling with intent-based policies that act on business state, eliminating the configuration drift and cloud waste typical of unmanaged fleets.

Mélanie Dallé
Senior Marketing Manager
Platform Engineering
Kubernetes
DevOps
10
 minutes
What is Kubernetes? The reality of Day-2 enterprise fleet orchestration

Kubernetes focuses on container orchestration, but the reality on the ground is far less forgiving. Provisioning a single cluster is a trivial Day-1 exercise. The true operational nightmare begins on Day 2. Teams that treat multi-cloud fleets like isolated pets inevitably face crushing YAML configuration drift, runaway AWS bills, and severe scaling bottlenecks.

Morgan Perry
Co-founder
Kubernetes
DevOps
5
 minutes
Top 10 Rancher alternatives in 2026: beyond cluster management

Rancher solved the Day-1 problem of launching clusters across disparate bare-metal environments. But in 2026, launching clusters is no longer the bottleneck. The real failure point is Day-2: managing the operational chaos, security patching, and configuration drift on top of them. Rancher is a heavy, ops-focused fleet manager that completely ignores the application developer. If your goal is developer velocity and automated FinOps, you must graduate from basic fleet management to an intent-based Kubernetes Management Platform like Qovery.

Morgan Perry
Co-founder

It’s time to change
the way you manage K8s

Turn Kubernetes into your strategic advantage with Qovery, automating the heavy lifting while you stay in control.