AI & Automation

The Agent Quality Crisis Nobody's Measuring

Dan M 20 March 2026 14 min read

We tracked code quality metrics across 14 engineering teams using AI coding agents and found that while velocity increased by 40%, architectural coherence declined by 28% and cross-module defects tripled. Teams are measuring the wrong thing.

Green dashboards, rotting codebase

Something is happening inside engineering organisations that adopt AI coding agents, and most of them can’t see it yet. The dashboards are green. Velocity is up. Sprint commitments are being met. PRs are merging faster. By every metric that engineering leaders report to their boards, AI-assisted development is working.

We spent four months studying 14 engineering teams across eight organisations, all using AI coding agents as part of their daily workflow. We tracked the metrics they were already reporting: story points completed, PRs merged per sprint, cycle time, deployment frequency. All up. Some dramatically.

Then we tracked the metrics they weren’t reporting.

Architectural coherence, measured by consistency of patterns across modules, declined by 28% on average. Cross-module defect rates tripled. Code duplication in non-trivial logic (not boilerplate, actual business logic) increased by 34%. And the time spent on “mystery bugs,” defects that took more than two days to diagnose, went up by 67%.

The teams that looked healthiest by velocity metrics were often the ones deteriorating fastest by structural metrics. That divergence is the crisis.

What we measured and how

We need to be specific about methodology, because “code quality” means different things to different people.

For each of the 14 teams, we established baselines from the six months before agent adoption and compared against the six months after. We tracked five categories:

Velocity metrics (the ones teams already measured): story points, PRs merged, cycle time, deployment frequency. These are table stakes. Every team had them.

Architectural coherence: we measured pattern consistency across modules. When a team has established conventions for error handling, data access, state management, and API design, do new modules follow those conventions? We used a combination of static analysis and manual review by senior engineers from outside each team. We scored coherence on a 1-10 scale across six architectural dimensions.

Cross-module defect density: bugs that manifest in module A but originate from changes in module B. These are the expensive ones. They require understanding of system-level interactions, not just local logic.

Knowledge fragmentation: we looked for indicators that institutional knowledge was not being preserved. The same utility reimplemented in different modules. Contradictory approaches to the same problem in code written weeks apart. Configuration values hardcoded in one place and pulled from environment variables in another.

Review effectiveness: how often did code review catch structural issues versus only catching surface-level problems (naming, formatting, obvious logic errors)?

The velocity trap

Every single team saw velocity improvements. The average was 40% more story points completed per sprint. Two teams saw gains above 60%. Engineering leaders were, understandably, pleased.

But velocity in software engineering has always been a lagging indicator disguised as a leading one. Shipping faster feels like progress. Sometimes it is. Sometimes you’re just accumulating structural problems at a higher rate.

Here is what 40% more velocity looked like in practice at one of our studied organisations, a mid-sized fintech with 120 engineers across eight teams.

Their payments processing team adopted an AI coding agent in May 2025. By August, the team was closing tickets 45% faster. Their sprint velocity chart looked like a hockey stick. The engineering director presented it at an all-hands as proof that the AI investment was paying off.

What the velocity chart didn’t show: in that same period, the payments module had grown three distinct patterns for handling transaction retries. The original pattern, documented in the team’s architecture decision records, used an exponential backoff with a circuit breaker. An agent session in June introduced a simpler linear retry, probably because the prompt didn’t include context about the backoff strategy. A third session in July implemented yet another approach, this time with a retry queue.

All three patterns worked. All three passed tests. All three merged through code review. None of them were wrong in isolation. Together, they created a system where retry behaviour was unpredictable depending on which code path a transaction followed. The bug that eventually surfaced, transactions being retried differently based on their entry point, took nine days to diagnose.

Nine days. For a bug that existed because three separate agent sessions each made a locally correct decision without awareness of the others.

Locally correct, globally incoherent

This is the pattern we saw repeated across almost every team we studied. AI coding agents are remarkably good at generating code that is correct within its immediate context. Given a clear prompt and a well-scoped task, the output is often clean, well-structured, and functionally correct.

The problem is scope. An agent session operates within a context window. It sees the files it’s given, the prompt it receives, and whatever context it retrieves. It does not see the architectural intent behind the codebase. It does not know that the team chose Repository pattern over Active Record for specific reasons six months ago. It does not know that the data validation approach in the user module was deliberately different from the one in the admin module because of a regulatory requirement.

We found this pattern in 11 of 14 teams. The agents produced code that was locally correct and globally incoherent. Each individual PR looked fine. The codebase, viewed as a whole, was slowly losing its internal consistency.

One team lead described it well during our interviews: “Every PR looks reasonable. I can’t point to a single one and say this is wrong. But when I zoom out and look at what we’ve built over the last three months, I don’t recognise the architecture anymore.”

The review problem

Code review is supposed to catch this. In practice, it doesn’t.

We analysed 2,400 PR reviews across our 14 teams. Before agent adoption, reviews caught structural issues (pattern violations, architectural inconsistencies, duplication of existing functionality) in about 18% of cases where such issues existed. After agent adoption, that rate dropped to 6%.

Two factors drive this decline.

First, volume. When agents generate code faster, the review queue grows. Reviewers face pressure to keep pace. They spend less time per review and naturally focus on what’s most visible: does the code work, is it readable, are there obvious bugs? Structural coherence requires holding the whole system in your head, or at least knowing what to look for. Under time pressure, that’s the first thing to go.

Second, confidence anchoring. AI-generated code tends to be well-formatted, consistently styled, and fluent. It looks professional. Reviewers told us they felt less inclined to challenge code that “looked like a senior engineer wrote it.” One reviewer said, “I catch myself assuming the agent knows what it’s doing. I have to actively remind myself to look for the things it can’t know.”

The result is a quality gate that was already imperfect becoming functionally decorative. Reviews still happen. They catch typos and logic errors. They rarely catch the slow architectural drift that compounds over months.

Five patterns of agent-induced degradation

Across our studied teams, we identified five recurring patterns of quality degradation. Not every team exhibited all five, but every team exhibited at least three.

Pattern drift. The codebase gradually develops multiple ways of doing the same thing. Error handling, data access, state management, logging. Instead of one consistent approach, you get two, then four, then six. Each individually defensible. Collectively, a maintenance burden that grows with every sprint.

Boundary erosion. Module boundaries exist for reasons. Sometimes performance. Sometimes separation of concerns. Sometimes regulatory. Agents don’t know why a boundary exists. They just see that the data they need is over there, and the fastest path is a direct import. We found boundary violations (modules reaching into the internals of other modules) increased by 41% after agent adoption.

Test theatre. Agents write tests. They’re often quite good at it. But we observed a pattern we started calling “test theatre”: tests that exercise the code but don’t verify the actual requirements. Coverage numbers go up. The tests would pass even if the underlying behaviour were subtly wrong. In one case, we found an agent-generated test suite where 30% of assertions tested implementation details rather than business outcomes.

Knowledge islands. Different parts of the codebase begin to reflect different “understandings” of how the system works. Module A uses one approach to authentication because the agent session that built it retrieved one set of context files. Module B uses a different approach because that session retrieved different files. Neither is wrong. Both are incomplete.

Zombie abstractions. Old abstractions and interfaces remain in the codebase even after the code has moved in a different direction. Agents don’t clean up. They build new things. The old layers accumulate, creating confusion about which abstraction is canonical.

What the metrics miss

The standard engineering metrics, the ones that show up in board reports and investor updates, are all measuring flow. How fast is code moving from backlog to production?

Flow matters. But flow without coherence is just entropy moving faster.

We think engineering organisations need a parallel set of structural health metrics. We’re not proposing a specific framework (that deserves its own research), but the dimensions we tracked give a starting point:

Pattern consistency score: across defined architectural conventions, what percentage of new code follows the established patterns?
Cross-module defect ratio: what percentage of defects involve interactions between modules rather than issues within a single module?
Knowledge duplication index: how often is equivalent functionality reimplemented rather than reused?
Boundary integrity: are module boundaries being respected, or is coupling increasing?
Review depth: are reviews catching structural issues, or only surface-level ones?

None of these are exotic. Senior engineers track them intuitively. They notice when the codebase “feels wrong” even if the metrics look fine. What we’re arguing is that this intuition needs to become measurement, especially as AI agents accelerate the rate at which code enters the system.

The compounding problem

Structural degradation compounds. This is the part that makes the problem urgent rather than merely interesting.

A codebase with two retry patterns is annoying. A codebase with two retry patterns, three error handling approaches, four data access conventions, and eroding module boundaries is a system where every new change is harder to make correctly. The agent doesn’t know which pattern to follow (there are several). The reviewer doesn’t have time to enforce consistency (the queue is long). The next sprint adds another layer of inconsistency.

We saw this compounding effect clearly in our data. Teams that had been using agents for more than four months showed steeper quality degradation curves than teams in their first two months. The problem accelerates.

One engineering director told us she was starting to see a new category of work appearing in sprint planning: “architectural reconciliation.” Tasks specifically dedicated to going back and aligning the different patterns that had accumulated. She estimated this was consuming 15-20% of team capacity. Which, she pointed out, was eating into the velocity gains that justified the agent adoption in the first place.

What this means for engineering leaders

If you’re running an engineering organisation that has adopted AI coding agents, or is about to, here is what we think you need to know.

Your velocity metrics are probably telling you a true but incomplete story. Code is shipping faster. That part is real. The question is what that speed is costing you in structural terms, and whether you’re measuring the cost at all.

The fix is not to stop using agents. The gains are real, and the competitive pressure to capture them is legitimate. The fix is to measure what agents are actually doing to your codebase, not just how fast they’re filling your backlog.

Start by establishing architectural baselines before agent adoption scales. Define your patterns. Document them in a way that’s both human-readable and machine-parseable. Measure coherence quarterly, at minimum.

Invest in review practices that go beyond “does this work?” to “does this fit?” That might mean dedicating senior engineer time specifically to architectural review, separate from functional review. It might mean building tools that flag pattern deviations automatically.

And watch the cross-module defect rate. Of all the metrics we tracked, this was the earliest and most reliable signal that structural quality was degrading. If your cross-module defects are trending up while your velocity is also trending up, you’re likely in the middle of this crisis and just haven’t felt the full effects yet.

The agent quality crisis is real. It’s measurable. And for most organisations, it’s invisible, because they’re looking at the wrong dashboard.