How We Built an Agentic DevOps Copilot to Automate Infrastructure Tasks and Beyond



If you are curious to see our Agentic DevOps Copilot in action, here are a few short demo videos:
- DevOps Copilot computes the deployment success rate
- DevOps Copilot generates a CISO report for the last week
- DevOps Copilot prevents modifying resources (read-only mode)
Phase 1: The Basic Agent
We started with a simple agent architecture: detect a user’s intent, and map that intent to a tool or action.

For example:
“Stop all dev environments after 6pm” → matches the stop-env tool
Each intent had its own mapping logic, and tools were invoked accordingly.
✅ Pros:
- Easy to implement
- Predictable behavior
- Clear control over each action
❌ Cons:
- Every new intent had to be hardcoded
- Complex workflows required chaining tools manually
- No flexibility for unexpected or unplanned user requests
This version served well for a few internal use cases. But the moment real users started asking real questions — like "How can I optimize my Dockerfile?" or "Why is my deployment time high this week?" — we hit a wall. We needed something more flexible.
Phase 2: Going Agentic
The second phase was a real leap: we designed an Agentic system.

Instead of hardcoded intent-to-tool mapping, the DevOps AI Agent Copilot receives user input, analyzes it, and dynamically plans a sequence of tool invocations to fulfill the request. Think of it as having a toolbox and letting the AI figure out how to use the tools in the right order.
Each tool:
- Has a clear interface (input/output)
- Is versioned and stateless
- Can be independently tested and improved
Benefits:
- Far more scalable and flexible
- Can solve unanticipated user needs
- Encourages clean tool abstraction

Challenges:
- Tool chaining is fragile - outputs must match expected inputs
- If one tool fails or behaves unexpectedly, the whole plan breaks
That’s when we realized: a dynamic system needs to be able to fail gracefully and recover. That took us to phase three.
Phase 3: Resilience and Recovery
We added resiliency layers and robust retry logic into the agentic execution flow.
Now, if the agent misuses a tool or the tool returns an unexpected output, the system:
- Analyzes the failure
- Updates its plan or fixes the step
- Retries with a corrected approach
This required tracking intermediate state, running validation between tool steps, and allowing re-planning if an execution fails.
Without this, reliability drops fast. With it, we started seeing successful completions of multi-step workflows that weren’t even anticipated during development.
Phase 4: Agentic + Memory
At this point, the Agentic Copilot could dynamically respond and recover from errors, but each request was treated in isolation.


That’s not how humans work. If I ask a follow-up question like:
“What about the staging cluster?”
… it should relate to my previous question about the production cluster. But it didn’t.
So we introduced conversation memory. It allows the Agentic DevOps Copilot to:
- Reuse previous answers
- Understand references and context
- Maintain continuity across a session
This drastically improved user experience - and opened the door to deeper, multi-step optimization and monitoring tasks.
What's Next
The Agentic DevOps Copilot is just getting started. We’re exploring:
- Improving the speed of planning: It can take up to 10 seconds to plan complex tasks, which is ok in a testing phase, but not so great for production.
- Self-hosted models: We use Claude Sonnet 3.7 right now, and even if we don't send sensitive information, we are a business solution and want to let our users use models that fit their compliance standards.
- Long-term memory across sessions: to tailor the experience to each user and learn from the previous experience.
We’re not building another chatbot. We’re building DevOps automation with a brain - so your team can focus on building products, not managing infrastructure.
Want to try it?
Our Copilot is in Alpha. You can ask it:
- “Generate usage stats over the last 30 days for team X.”
- “Optimize this Dockerfile.”
- “Stop all environments inactive for 6h and notify the team.”
If you’re a Qovery user and want early access, check the Slack message in your workspace or contact us directly.
--
Learn more about how we use QDrant - a super powerful open-source Vector Database

Suggested articles
.webp)