How I Think About AI Agents in Software Delivery Without Letting Them Wreck Production

AI agents are showing up everywhere in software delivery toolchains. Some of that is genuine progress. A lot of it is risk dressed up as capability.

I’ve spent the past couple of years running experiments — in real engineering teams, on real codebases — trying to figure out where agents actually help and where they create new categories of failure. Here’s how I think about it now.

The core question isn’t “can it do this?” — it’s “what happens when it’s wrong?”

This is the frame I keep coming back to. An AI agent that generates test cases is interesting. An AI agent that auto-merges PRs based on test results is a different kind of problem entirely.

The failure mode matters more than the capability. Before deploying any agent in a delivery pipeline, I want to know:

How does this fail? Silent corruption? Noisy but recoverable errors? Cascading failures?
How quickly do we detect the failure? Seconds, hours, or after it’s in production?
What’s the blast radius? Does a wrong decision affect one PR, one service, or the whole platform?

Agents with narrow scope, fast feedback loops, and low blast radius are where I start.

Where agents tend to work well

Code review assistance. Not as the final gate — as a first pass. Agents are good at catching patterns: missing null checks, inconsistent error handling, obvious performance issues. They free up human reviewers to focus on architecture and intent.

Test generation as a starting point. AI-generated tests aren’t production-ready tests, but they’re a useful scaffold. They surface edge cases a developer might not have considered, and they’re cheap to throw away. The key is treating them as drafts, not deliverables.

Incident triage. Parsing log data, correlating signals, and suggesting probable causes. Agents don’t replace the engineer who actually understands the system, but they can compress the time to first hypothesis from thirty minutes to two.

Documentation. Generating first drafts from code, keeping runbooks synchronized with system changes. Low-stakes, high-volume, exactly the kind of work where agents add value without adding risk.

Where I’ve seen agents create problems

Anywhere in the critical path without a human checkpoint. Auto-deploy agents, auto-merge agents, agents that make resource allocation decisions — all of these need a human in the loop until you have a very clear picture of their failure modes in your specific environment.

When the agent’s training distribution doesn’t match your codebase. Generic agents trained on public code perform differently on internal codebases with domain-specific patterns. Measure accuracy before you trust outcomes.

When failure is silent. The worst agent failures aren’t crashes — they’re subtly wrong outputs that pass tests, get merged, and sit in production for weeks before someone notices the behavior is slightly off.

A practical deployment sequence

When I’m introducing agents into a delivery pipeline, I do it in stages:

Observe only. Run the agent alongside the existing process and compare outputs. Don’t act on them.
Recommend with human approval. Surface agent suggestions in the workflow. Measure how often engineers accept, modify, or reject them.
Automate low-stakes decisions. Start with decisions that are easy to reverse and have short feedback loops.
Expand scope carefully. Each expansion requires its own observation period.

This isn’t slow — it’s how you avoid the agent incident that ends the whole program.

The meta-point

AI agents are tools. The question isn’t whether to use them — it’s whether you understand the tool well enough to deploy it safely in your environment. Most teams that have had bad experiences didn’t have bad agents. They had insufficient understanding of where the agent was reliable and where it wasn’t.

That understanding doesn’t come from reading benchmarks. It comes from running controlled experiments in your actual context.