Blog
Cloud Migration
Developer Experience
Engineering
minutes

[Alan] From nginx to Envoy: What Actually Happens When You Swap Your Proxy in Production

Migrating from nginx Ingress to Envoy Gateway? Discover how Alan migrated 100+ services in one month, the technical hurdles they faced (like Content-Length normalization), and why staging isn't always enough.
April 29, 2026
William Occelli
Platform Engineer at Alan
Summary
Twitter icon
linkedin icon

At Alan, we run over a hundred services on Kubernetes managed through Qovery. For years, all our ingress traffic flowed through the nginx Ingress Controller, the de facto standard for routing HTTP traffic into Kubernetes clusters.

In late 2025, Kubernetes announced it would be retiring its nginx Ingress Controller. Qovery, our platform provider, reacted by adopting Envoy Gateway, the emerging standard built on the Kubernetes Gateway API. They announced the deprecation of nginx Ingress in favor of Envoy Gateway, giving teams a migration window before the cutoff.

We needed to migrate all services (including production) in 1 month.

Qovery designed a phased migration path:

  • Phase 0: nginx only (the status quo)
  • Phase 1: Dual stack. Envoy deployed alongside nginx, shadow testing
  • Phase 2: Envoy primary. Traffic routes through Envoy, nginx still available as fallback
  • Phase 3: Remove nginx entirely

This gave us a safety net. We could enable Envoy, observe, and roll back if things went wrong.

In Theory

On paper, the migration was almost boring. Qovery abstracted most of the complexity behind configuration flags. Enabling Envoy was literally a Terraform setting:

Envoy-specific settings (compression, timeouts, log formats) mapped cleanly to new advanced settings keys. The migration path was clear: enable Envoy alongside nginx, verify traffic, cut over, then remove nginx.

Same traffic, different proxy - it should be painless, right?

In Practice

It was not painless.

But it’s worth noting that most of our services migrated with little to no impact. The majority of our workloads are managed directly through Qovery with standard configurations, and for those, the switch to Envoy was largely transparent. We did observe a couple of global changes such as 204 header normalization and differences in log severity handling but these were easy to detect and fix early during the staging phase.

If you’re facing a similar migration, here’s a non-exhaustive list of the surprises we ran into.

The 204 Empty Response Bug

nginx and Envoy disagree on what a “204 No Content” response should look like.

nginx forwarded Content-Length and Content-Type headers on 204 responses as-is. Whatever the upstream sent, the client received.

Envoy is strict about HTTP semantics in both directions: it adds Content-Length: 0 to bodiless requests (which we’ll see later), and strips content headers from bodyless responses (HTTP 204).

The problem? Our Flask backend adds Content-Type to all responses, including 204s. nginx compensated by leaving the Content-Length: 0. Envoy strips it, leaving clients with a Content-Type but no length, causing parse failures on iOS and frontend clients.

How we found it: Our continuous deployment end-to-end tests caught it first. Automated Waldo tests started failing after the Envoy cutover in staging, which pointed us straight to the response parsing issue.

Fix: We stripped Content-Type, Content-Length, and Transfer-Encoding from 204 responses in Flask. Quick to find, quick to fix. A good warm-up for what came next.

The Port Mismatch Outage

This one was a fundamental difference in how nginx and Envoy route traffic to pods.

nginx Ingress bypasses the Kubernetes Service object entirely. It discovers pod endpoints directly and connects to them using pod_IP:targetPort. The Service definition is essentially ignored for routing purposes.

Envoy Gateway routes strictly through the Kubernetes Service. The HTTPRoute targets the Service port, and if the port in your configuration doesn’t match what the Service actually exposes, you get an HTTP 500.

Our Nextcloud instance, used as a secure document exchange platform for regulated file sharing, had a mismatch: the K8s Service exposed port 8080 (with targetPort 80 on the pod), but our Qovery configuration declared internal_port=80. nginx never cared; it went straight to the pod. But Envoy tried to reach port 80 on the Service, which didn’t exist, leading to a 500 error.

The real kicker? We didn’t catch this in staging because this service isn’t monitored or actively used there. It went straight to prod.

Fix: Updated internal_port from 80 to 8080. Fixed same day, but it was a wake-up call.

The 0KB File Mystery

This one was the most insidious.

Six days after the Envoy cutover, we noticed that all files uploaded to our secure document exchange platform via WebDAV chunked upload appeared as 0KB. The upload “succeeded”, no errors, HTTP 201 on every step, but the resulting files were empty.

The data wasn’t actually lost. The objects in S3 had correct sizes. Only file metadata was wrong. But from the user’s perspective, every file uploaded since the Envoy switch was broken.

Here’s the thing about WebDAV chunked uploads: the flow is MKCOL (create upload directory) → PUT (upload chunks) → MOVE (assemble chunks into final file). That last MOVE request has no body, it just tells “take those chunks and combine them.”

nginx doesn’t add a Content-Length header to bodiless requests. The header is absent or empty.

Envoy always normalizes bodiless requests by adding an explicit Content-Length: 0. Ironically, the opposite of the HTTP 204 fix described above, Envoy adds a Content-Length where there was none.

But Nextcloud’s SabreDAV library reads the Content-Length from the MOVE request and uses it as the assembled file size. When it sees 0, it interprets this as "the assembled file should be 0 bytes." The actual chunk data is irrelevant, the metadata says zero, so zero it is.

How we found it: Because the incident was communicated loudly to the right people, a colleague quickly suspected Envoy. They proved it by testing the same upload through both paths on the same pod: direct to nginx worked, through Envoy failed. From there, we binary-searched: split in half, test each group, narrow down. Two hours later, one culprit: Content-Length: 0.

Fix: We stripped Content-Length from WebDAV MOVE requests before they reach the application, so SabreDAV never sees the Envoy-injected zero. No more 0Kb files !

The Emergency Logs False Alarm

When a user closes their browser tab mid-request, the proxy logs a client disconnect. But nginx and Envoy represent this event very differently, which led to some unexpected consequences.

nginx logs client disconnects as HTTP 499, a non-standard status code that nginx invented for exactly this purpose. In our log pipeline, Datadog maps 4xx statuses to severity warning. Nobody gets pinged.

Envoy logs client disconnects as HTTP 0 with a status_details field like downstream_remote_disconnect. Our Datadog's log integration pipeline maps status 0 to severity... emergency.

Suddenly our oncall was getting pinged for what used to be routine background noise. Same events, different proxy, different severity, and a lot of unnecessary adrenaline.

Fix: We added a Datadog Category Processor to remap http.status_code:0 with specific status_details patterns (remote reset, client disconnect) back to warning.

Learnings

Don’t rely on staging alone.

We caught several issues in staging first. But the Nextcloud outage went straight to prod because the service isn’t used in staging. And the Datadog severity issue only caused issues at production log volume.

Staging gives you confidence. It doesn’t give you certainty. And this gap is only going to widen. As AI-assisted development accelerates the pace of changes, the idea that staging can catch everything becomes increasingly unrealistic. You need production observability, fast rollback mechanisms, and a culture that treats prod incidents effectively.

You will break things. Whether that’s terrible or great depends on your culture.

You won’t get through a migration like this without breaking something. A file upload protocol nobody would think to test synthetically. A port configuration that didn’t matter until the proxy changed how it routes. A log severity mapping buried three layers deep in a Datadog pipeline.

The question isn’t “how do we avoid breakage?”, it’s “how do we make breakage small, visible, and fast to fix?” That’s a culture problem, not a tech problem. Four things at Alan made the difference:

  • Distributed responsibility, not blame. Nobody got blamed. Our teams have a bias for action: ship fast, catch issues in real conditions, fix forward. Responsibility is shared, both for success and for failure. We’re all in the same boat. And this isn’t just empty talk; look at a message posted on Slack during the migration:
Press enter or click to view image in full size
  • Radical transparency. Every step of the migration had a thread telling the real story, written down and accessible to the right people. Having a clear, written communication trail allowed us to identify and redirect issues faster. In addition, sharing about errors and outages helps us continuously improve and avoid making the same mistakes over and over again.
  • Strong ownership, open collaboration. One person drives the project. Clear ownership means issues are acknowledged faster, and you’re accountable (in the best sense) for the migration to succeed. A single point of contact doesn’t mean working in isolation; it means the team always knows where to go. Coupled with clear communication, you can move fast and identify blockers quickly.

Maintain a close relationship with your key partners

Qovery’s phased migration approach was critical, and the dual-stack phase meant we could run Envoy alongside nginx and compare behavior before committing. Their support during our incidents was responsive and hands-on. And when we flagged that the original deadline was too short to test properly, they extended it without friction.

The time you invest in your vendor relationship pays off exactly when you need it most. This means not just filing bug reports or feature requests, but sharing context when things are smooth, celebrating wins together, and being a real partner rather than just a customer. When something breaks and you can reach someone who already knows your setup, your constraints, and your history, the conversation can move directly to resolution. That trust isn’t built during incidents. It’s built in the quiet moments before them.

-----

The Alan team’s journey highlights a fundamental truth of modern engineering: infrastructure will always evolve, and breaking changes are inevitable. The difference between a month-long migration and a multi-year legacy burden lies in the tools and partners you choose.

By leveraging Qovery, Alan was able to transition from nginx to Envoy Gateway across hundreds of services using a controlled, phased approach—turning what could have been a manual infrastructure nightmare into a series of manageable configuration shifts.

Don’t let infrastructure migrations stall your roadmap. Experience the power of Qovery for free or book a demo with our engineering team today.

Share on :
Twitter icon
linkedin icon
Tired of fighting your Kubernetes platform?
Qovery provides a unified Kubernetes control plane for cluster provisioning, security, and deployments - giving you an enterprise-grade platform without the DIY overhead.
See it in action

Suggested articles

Cloud Migration
Developer Experience
Engineering
 minutes
[Alan] From nginx to Envoy: What Actually Happens When You Swap Your Proxy in Production

Migrating from nginx Ingress to Envoy Gateway? Discover how Alan migrated 100+ services in one month, the technical hurdles they faced (like Content-Length normalization), and why staging isn't always enough.

William Occelli
Platform Engineer at Alan
Kubernetes
8
 minutes
Kubernetes management in 2026: mastering Day-2 ops with agentic control

The cluster coming up is the easy part. What catches teams off guard is what happens six months later: certificates expire without a single alert, node pools run at 40% over-provisioned because nobody revisited the initial resource requests, and a manual kubectl patch applied during a 2am incident is now permanent state. Agentic control planes enforce declared state continuously. Monitoring tools just report the problem.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
6
 minutes
Kubernetes observability at scale: how to cut APM costs without losing visibility

The instinct when setting up Kubernetes observability is to instrument everything and send it all to your APM vendor. That works fine at ten nodes. At a hundred, the bill becomes a board-level conversation. The less obvious problem is the fix most teams reach for: aggressive sampling. That is how intermittent failures affecting 1% of requests disappear from your monitoring entirely.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
 minutes
How to automate environment sleeping and stop paying for idle Kubernetes resources

Scaling your deployments to zero is only half the battle. If your cluster autoscaler does not aggressively bin-pack and terminate the underlying worker nodes, you are still paying for idle metal. True environment sleeping requires tight integration between your ingress layer and your node provisioner to actually realize FinOps savings.

Mélanie Dallé
Senior Marketing Manager
Kubernetes
DevOps
6
 minutes
10 best Kubernetes management tools for enterprise fleets in 2026

The structure, table, tool list, and code blocks are all worth keeping. The main work is fixing AI-isms in the prose, updating the case study to real metrics, correcting the FAQ format, and replacing the CTAs with the proper HTML blocks. The tool descriptions need the "Core strengths / Potential weaknesses" headers made less template-y, and the intro needs a sharper human voice.

Mélanie Dallé
Senior Marketing Manager
DevOps
Kubernetes
Platform Engineering
6
 minutes
10 best Red Hat OpenShift alternatives to reduce licensing costs

For years, Red Hat OpenShift has been the safe choice for heavily regulated, on-premise environments. It operates as a secure fortress. But in the public cloud, that fortress acts as an expensive prison. Paying proprietary per-core licensing fees on top of your standard AWS or GCP compute bill is a redundant "middleware tax." Escaping OpenShift requires decoupling your infrastructure from your developer experience by running standard, vanilla Kubernetes paired with an agentic control plane.

Morgan Perry
Co-founder
AI
Product
3
 minutes
Qovery Skill for AI Agents: Deploy Apps in One Prompt

Use Qovery from Claude Code, OpenCode, Codex, and 20+ AI Coding agents

Romaric Philogène
CEO & Co-founder
Kubernetes
 minutes
Stopping Kubernetes cloud waste: agentic automation for enterprise fleets

Agentic Kubernetes resource reclamation is the practice of using an autonomous control plane to continuously identify, suspend, and delete idle infrastructure across a multi-cloud Kubernetes fleet. It replaces manual cleanup and reactive autoscaling with intent-based policies that act on business state, eliminating the configuration drift and cloud waste typical of unmanaged fleets.

Mélanie Dallé
Senior Marketing Manager

It’s time to change
the way you manage K8s

Turn Kubernetes into your strategic advantage with Qovery, automating the heavy lifting while you stay in control.