Blog
Kubernetes
6
minutes

Kubernetes observability at scale: how to cut APM costs without losing visibility

The instinct when setting up Kubernetes observability is to instrument everything and send it all to your APM vendor. That works fine at ten nodes. At a hundred, the bill becomes a board-level conversation. The less obvious problem is the fix most teams reach for: aggressive sampling. That is how intermittent failures affecting 1% of requests disappear from your monitoring entirely.
April 27, 2026
Mélanie Dallé
Senior Marketing Manager
Summary
Twitter icon
linkedin icon

Key points:

  • APM pricing is structurally designed to grow with you: Per-host fees, per-GB ingestion, custom metric charges, and egress costs all compound in ways that are genuinely hard to forecast until it is already too late.
  • Sampling to control costs creates monitoring blind spots: Reducing log verbosity to manage APM spend is the observability equivalent of turning off smoke detectors to save money on batteries.
  • The SRE bottleneck is an architecture problem, not a headcount one: When only infrastructure specialists can read your dashboards, every production incident routes through the same two people.

The observability bill nobody budgeted for

There is a predictable arc that plays out at growing SaaS companies. The team adopts a commercial APM platform early because it is fast to set up and the dashboards look good in the demo. For the first year it is fine. Then the cluster grows. More nodes, more services, more logs.

At some point, someone pulls the cloud bill and notices that observability spend is approaching, or in some cases exceeding, core infrastructure spend. That is not an edge case. It is the consequence of how commercial APM pricing is designed.

Datadog, for example, charges $15 per host for infrastructure monitoring and $31 per host for APM, with each Kubernetes node counting as a separate host. Log management is billed per GB ingested, with additional charges for retention beyond 15 days. Custom metrics — the ones that actually capture application-specific behaviour — are priced per unique time series and frequently produce the largest line items on the bill.

Then there are egress fees. Every metric, log line, and trace span that leaves your cluster for an external SaaS platform incurs cloud provider egress charges. For a high-throughput environment generating terabytes of telemetry monthly, egress alone can add tens of thousands of dollars per year. That money is not buying you better observability. It is just the cost of moving data from your infrastructure to someone else's.

The fleet-scale problem: why this gets worse, not better

At a single cluster, observability cost is annoying. At fleet scale, it becomes a structural budget problem. The multi-cloud version is worse still.

Organisations running Kubernetes across AWS, Azure, and GCP inherit three separate observability ecosystems. CloudWatch, Azure Monitor, and GCP Cloud Operations each have their own query language, alerting syntax, retention policies, and pricing model. When a service degrades and the team does not immediately know which cloud or region is affected, the diagnosis requires expertise across all three platforms simultaneously.

The gap between what each provider monitors is where the real incidents live. A network issue between AWS and GCP may not appear in either provider's native monitoring because neither has visibility into the other side of the connection. These blind spots get discovered reactively, through customer complaints, not through the monitoring stack that was supposed to catch them.

On a fleet of 50 clusters spread across three clouds, an engineer who actually knows all three provider observability systems well enough to correlate a cross-cloud incident is a rare resource. If your incident response process depends on finding that person at 2am, you have a reliability problem dressed up as an observability problem.

The cost trap teams do not see coming

The response most teams reach for when APM costs get out of control is to reduce the data volume: lower log verbosity, increase trace sample rates, cut custom metric cardinality. It feels like the responsible thing to do.

It is also how you create monitoring blind spots.

An intermittent failure affecting 1% of requests at a 10% trace sample rate shows up in your traces roughly once every 1,000 occurrences. Depending on traffic volume, that could mean hours of a degraded user experience before your monitoring surfaces anything actionable.

# What aggressive sampling looks like in an OpenTelemetry collector config
processors:
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 10   # You are now blind to 90% of your traces

# The traces you drop are not randomly distributed across failure types.
# Slow requests and errors are overrepresented in dropped spans because
# they take longer to complete and are more likely to be cut off
# mid-collection during high-load sampling decisions.

The perverse incentive built into per-GB APM pricing is that the data most valuable for diagnosing production failures, including high-cardinality logs, full trace spans, and raw event streams, is also the most expensive to retain. Teams end up optimising for the bill rather than for reliability.

In-cluster observability: fixing the economics at the source

The cost structure of external APM breaks down at one architectural decision: exporting telemetry data outside the cluster. Everything that follows, the egress fees, the ingestion charges, the vendor retention limits, is a consequence of that single choice.

The alternative is to keep observability data inside the cluster, stored in your own cloud account, queried through open-source tooling with no per-GB pricing.

A standard in-cluster stack built on kube-prometheus-stack gives you Prometheus for metrics collection, Loki for log aggregation, and Grafana for visualisation. Storage costs are determined purely by your cloud provider's object storage rates. On AWS S3 Standard, that is approximately $0.023 per GB per month. On a high-volume environment generating 500 GB of compressed telemetry monthly, that is roughly $11.50. The equivalent data volume through an external APM ingestion pipeline costs orders of magnitude more.

# Prometheus ServiceMonitor: scraping application metrics in-cluster
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: checkout-service-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: checkout-service
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production

No data leaves the cluster. No egress fees. No per-GB ingestion charge. The ServiceMonitor tells Prometheus where to scrape, and the data stays inside your VPC.

For logs, Loki uses the same label-based model as Prometheus and stores compressed chunks directly in object storage. A LogQL query to surface all error-level events from a specific service looks like this:

{namespace="production", app="checkout-service"} |= "level=error"
  | json
  | line_format "{{.timestamp}} {{.msg}} {{.error}}"

Run that query in Grafana against your in-cluster Loki instance. No API key, no ingestion quota, no retention cliff at 15 days unless you set one yourself.

Day 2 Operations & Scaling Checklist

Is Kubernetes a bottleneck? Audit your Day 2 readiness and get a direct roadmap to transition to a mature, scalable Platform Engineering model.

Kubernetes Day 2 Operations & Scaling Checklist

The SRE bottleneck: an architecture problem dressed as a staffing one

Here is the observability problem that does not show up on the cloud bill but costs more than the egress fees: every time a developer cannot diagnose their own production incident, an SRE has to do it for them.

The reason developers cannot self-serve is not that they are less capable. It is that commercial APMs are built for infrastructure operators. They present raw Prometheus metrics, network topologies, and trace waterfalls in formats that require deep operational expertise to interpret. A developer whose service is throwing 500s in production stares at a PromQL query they did not write and escalates to the on-call SRE.

That SRE then spends 45 minutes correlating application logs with infrastructure metrics across disconnected dashboards, identifying the root cause, and communicating the findings back. Those 45 minutes repeat across every incident, for every service, indefinitely. The SRE is not doing capacity planning or reliability engineering. They are running a help desk.

# What an SRE actually runs when a developer reports "my pod keeps restarting"

# Step 1: check recent events on the pod
kubectl describe pod checkout-7d4f9b-xkp2m -n production | grep -A 20 Events

# Step 2: check for OOMKill events specifically
kubectl get events -n production \
  --field-selector reason=OOMKilling \
  --sort-by='.lastTimestamp' | tail -20

# Step 3: compare actual memory usage against defined limits
kubectl top pod -n production --containers | grep checkout

# Step 4: see what limits are actually set in the deployment
kubectl get deployment checkout -n production \
  -o jsonpath='{.spec.template.spec.containers[*].resources}' | jq .

# A developer with access to a platform that surfaces this automatically
# resolves this without touching the SRE at all.

Platforms that translate infrastructure signals into plain language change this dynamic. When a pod OOMKills, the developer sees what happened, which service was affected, and the memory trend leading up to the event, rather than a raw Kubernetes event they cannot interpret. They resolve it themselves, or they escalate with enough context that the SRE can fix it in five minutes rather than forty-five.

Qovery Observe: built-in, not bolted on

The reason observability correlation is painful in most toolchains is that the tools were built independently and integrated after the fact. Deployment data lives in your CI/CD system. Metrics live in the APM. Logs live somewhere else. Correlating a spike in error rate with the deployment that caused it requires manually matching timestamps across three systems.

Qovery Observe sidesteps this because observability is provisioned as part of the cluster, not added as a separate integration. Each managed cluster runs its own stack: Prometheus for metrics, Loki for logs. Data never leaves the organisation's cloud account.

  • Zero egress and ingestion fees. There are no per-GB ingestion charges because data does not leave the cluster. No egress fees because telemetry does not transit the public internet to reach a vendor's ingestion endpoint.
  • No proprietary lock-in. The stack is built on Prometheus and Loki. Teams with existing Grafana dashboards connect them to Qovery's in-cluster data stores without modification. If you have already invested in dashboards for your Kubernetes monitoring setup, they continue working.
  • Automated correlation. During an incident, deployment history, environment configuration, logs, and metrics appear in the same interface. A spike in error rate gets correlated with the deployment that preceded it automatically, because both systems share the same data model. No manual timestamp matching required.
  • Extensible by default. Teams that need OpenTelemetry instrumentation or custom exporters work against standard interfaces:
# OpenTelemetry collector sending to in-cluster endpoints
# rather than external vendor ingestion URLs
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write
  loki:
    endpoint: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      exporters: [loki]

Telemetry stays inside the cluster. The collector endpoints point to in-cluster services, not external ingestion URLs. The billing impact is purely the object storage cost of what Prometheus and Loki write to your S3 or GCS bucket.

The AI Copilot: making observability data usable

The gap between having observability data and being able to act on it quickly is where most incident time gets lost. The data exists. Somebody just has to know how to query it.

Qovery's AI DevOps Copilot addresses this directly. A developer can ask in plain language: "Why is my checkout service slow?" or "Show me any services in production showing degraded performance compared to last week." The Copilot correlates metrics, logs, and deployment events to surface root causes and suggest remediation steps.

Beyond active debugging, passive monitoring runs continuously across the platform. Performance regressions and resource anomalies get flagged before they reach the threshold of a customer-facing incident. This is what actually moves the needle on Mean Time to Resolution. Not better dashboards. Fewer incidents requiring full triage in the first place.

🚀 Real-world proof

Julaya, a B2B fintech operating across West Africa, needed to cut infrastructure costs and reduce the operational overhead that was slowing down a fast-growing engineering team.

The result: A 40% increase in delivery speed, a 25% reduction in infrastructure costs, and a 35% boost in developer productivity — without adding headcount to the platform team. Read the Julaya case study.

Turning observability into a predictable cost line

The financial case for in-cluster observability is straightforward once you model it.

External APM costs compound across multiple billing dimensions simultaneously: per-host infrastructure fees, per-host APM fees, per-GB log ingestion, per-GB trace ingestion, custom metric charges per time series, and egress fees on top of everything else. These line items are metered separately, subject to overage pricing, and produce significant monthly variance that makes budgeting genuinely difficult.

In-cluster observability collapses this to a single dimension: cloud object storage. Prometheus remote write and Loki chunk storage land in your S3 or GCS bucket at standard rates. For most environments, even high-volume ones, this represents a cost reduction of 60 to 80 percent compared to an equivalent external APM deployment. And unlike APM pricing, object storage pricing does not change based on how many services you instrument or how many custom metrics you define.

# Identify your top 10 highest-cardinality metric series
# Run against your in-cluster Prometheus to understand storage cost drivers
topk(10,
  count by (__name__, job, namespace) (
    {__name__=~".+"}
  )
)
# High-cardinality series from specific namespaces are your
# primary lever for controlling Prometheus storage growth

The architectural shift also eliminates the vendor retention cliff. Most commercial APMs default to 15-day retention with significant price increases for longer periods. In-cluster, retention is a storage configuration. You set it based on compliance requirements and operational need, not based on what a vendor tier happens to include.

Conclusion

For many organisations, Kubernetes observability spend now rivals core infrastructure costs. Not because the tooling provides equivalent value, but because the pricing model of external APMs is structurally incompatible with the telemetry volumes that Kubernetes generates at scale.

Keeping Prometheus and Loki data inside the cluster fixes the economics. Egress fees disappear. Per-GB ingestion charges disappear. Retention limits become a configuration rather than a billing tier. Storage costs scale at object storage rates rather than APM ingestion rates.

The developer productivity argument is separate but equally concrete. When observability data is accessible and interpretable without deep infrastructure expertise, incidents get resolved faster and SRE time goes toward work that actually improves the platform. If you are auditing cloud spend and want to see the full picture of where Kubernetes operational overhead accumulates, the 10 best practices for managing Kubernetes at scale is worth reading alongside this.

The fix does not require replacing your entire monitoring stack overnight. It requires changing where the data lives. Everything else follows from that.

FAQs

Why is Kubernetes observability so expensive with traditional APMs?

Commercial APMs like Datadog and New Relic charge across multiple billing dimensions simultaneously: per-host fees for infrastructure monitoring, separate per-host fees for APM, per-GB charges for log ingestion, per-GB for trace ingestion, and custom metric fees per time series. On top of all that, every byte of telemetry that leaves your cluster for an external ingestion endpoint incurs cloud provider egress charges. At small scale these costs are manageable. At fleet scale they compound in ways that are genuinely difficult to forecast or control, particularly when custom metrics cardinality grows with application complexity.

What is the difference between external APM and in-cluster observability?

External APMs require you to export telemetry data to a third-party vendor's ingestion infrastructure, which triggers egress fees on the way out and per-GB ingestion charges on the way in. In-cluster observability keeps your Prometheus metrics and Loki logs inside your own cloud account, stored in your own object storage bucket at standard cloud provider rates. No egress, no ingestion fees, no vendor-controlled retention limits. The tradeoff is that you own the stack, but on Qovery that ownership is abstracted: the observability infrastructure is provisioned and maintained automatically when a cluster joins the platform.

How does Qovery's AI Copilot reduce the SRE bottleneck?

The SRE bottleneck exists because traditional observability tooling presents data in formats that require infrastructure expertise to interpret. Developers facing production issues escalate to SREs because they cannot read the dashboards themselves. Qovery's AI Copilot translates infrastructure signals into natural language: a developer can ask why their service is slow and get a root cause analysis correlated across logs, metrics, and deployment history, without knowing PromQL or LogQL. That shifts routine incident resolution back to the team that owns the service, and SREs reclaim time for work that actually requires their expertise.

Share on :
Twitter icon
linkedin icon
Tired of fighting your Kubernetes platform?
Qovery provides a unified Kubernetes control plane for cluster provisioning, security, and deployments - giving you an enterprise-grade platform without the DIY overhead.
See it in action

Suggested articles

Kubernetes
8
 minutes
Kubernetes management in 2026: mastering Day-2 ops with agentic control

The cluster coming up is the easy part. What catches teams off guard is what happens six months later: certificates expire without a single alert, node pools run at 40% over-provisioned because nobody revisited the initial resource requests, and a manual kubectl patch applied during a 2am incident is now permanent state. Agentic control planes enforce declared state continuously. Monitoring tools just report the problem.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
6
 minutes
Kubernetes observability at scale: how to cut APM costs without losing visibility

The instinct when setting up Kubernetes observability is to instrument everything and send it all to your APM vendor. That works fine at ten nodes. At a hundred, the bill becomes a board-level conversation. The less obvious problem is the fix most teams reach for: aggressive sampling. That is how intermittent failures affecting 1% of requests disappear from your monitoring entirely.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
 minutes
How to automate environment sleeping and stop paying for idle Kubernetes resources

Scaling your deployments to zero is only half the battle. If your cluster autoscaler does not aggressively bin-pack and terminate the underlying worker nodes, you are still paying for idle metal. True environment sleeping requires tight integration between your ingress layer and your node provisioner to actually realize FinOps savings.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
10 best Kubernetes management tools for enterprise fleets in 2026

The structure, table, tool list, and code blocks are all worth keeping. The main work is fixing AI-isms in the prose, updating the case study to real metrics, correcting the FAQ format, and replacing the CTAs with the proper HTML blocks. The tool descriptions need the "Core strengths / Potential weaknesses" headers made less template-y, and the intro needs a sharper human voice.

Mélanie Dallé
Senior Marketing Manager
DevOps
Kubernetes
Platform Engineering
6
 minutes
10 best Red Hat OpenShift alternatives to reduce licensing costs

For years, Red Hat OpenShift has been the safe choice for heavily regulated, on-premise environments. It operates as a secure fortress. But in the public cloud, that fortress acts as an expensive prison. Paying proprietary per-core licensing fees on top of your standard AWS or GCP compute bill is a redundant "middleware tax." Escaping OpenShift requires decoupling your infrastructure from your developer experience by running standard, vanilla Kubernetes paired with an agentic control plane.

Morgan Perry
Co-founder
AI
Product
3
 minutes
Qovery Skill for AI Agents: Deploy Apps in One Prompt

Use Qovery from Claude Code, OpenCode, Codex, and 20+ AI Coding agents

Romaric Philogène
CEO & Co-founder
Kubernetes
 minutes
Stopping Kubernetes cloud waste: agentic automation for enterprise fleets

Agentic Kubernetes resource reclamation is the practice of using an autonomous control plane to continuously identify, suspend, and delete idle infrastructure across a multi-cloud Kubernetes fleet. It replaces manual cleanup and reactive autoscaling with intent-based policies that act on business state, eliminating the configuration drift and cloud waste typical of unmanaged fleets.

Mélanie Dallé
Senior Marketing Manager
Platform Engineering
Kubernetes
DevOps
10
 minutes
What is Kubernetes? The reality of Day-2 enterprise fleet orchestration

Kubernetes focuses on container orchestration, but the reality on the ground is far less forgiving. Provisioning a single cluster is a trivial Day-1 exercise. The true operational nightmare begins on Day 2. Teams that treat multi-cloud fleets like isolated pets inevitably face crushing YAML configuration drift, runaway AWS bills, and severe scaling bottlenecks.

Morgan Perry
Co-founder

It’s time to change
the way you manage K8s

Turn Kubernetes into your strategic advantage with Qovery, automating the heavy lifting while you stay in control.