AI & Automation

The Verification Gap

Dan M 12 March 2026 12 min read

Code review was designed for human-speed output. When agents produce ten times the volume, manual review becomes theatre. We found that reviewers catch 23% of defects in agent-generated code versus 61% in human-written code, and the gap widens as volume increases.

The assumption nobody is testing

Code review is the backbone of software quality. Two sets of eyes on every change. A reviewer who can catch what the author missed, question assumptions, flag edge cases. For forty years, this has worked well enough. The defect rate in reviewed code is dramatically lower than in unreviewed code. The practice is so embedded that most engineering teams treat it as a given: review happens, quality follows.

We wanted to know whether that assumption still holds when the code is written by AI agents.

It doesn’t.

What we measured

Over four months, we tracked review effectiveness across six engineering teams at three organisations. All six teams were using AI coding agents for production work, ranging from 30% to 70% of their total code output. We measured a simple thing: the percentage of known defects that reviewers identified during code review, separated by whether the code was human-written or agent-generated.

We seeded both categories with equivalent defects: logic errors, missing edge case handling, incorrect boundary conditions, subtle race conditions. The reviewers didn’t know which code was human-written and which was agent-generated. The defects were comparable in severity and detectability.

The results were stark.

For human-written code, reviewers caught 61% of seeded defects. This is consistent with published research on code review effectiveness, which typically lands between 55% and 70%. Nothing surprising.

For agent-generated code, reviewers caught 23%.

That’s not a marginal decline. Reviewers were catching fewer than one in four defects in agent-generated code.

Three forces driving the gap

We spent two months conducting interviews, observing review sessions, and analysing review timestamps and comment patterns. Three distinct mechanisms emerged.

Volume overwhelm

The most obvious factor. When an agent produces code at five to ten times the rate of a human developer, the review queue grows proportionally. One team we observed went from an average of 340 lines of code per review to over 1,800 lines per review after adopting agents. The number of reviewers didn’t change. The time allocated for review didn’t change. The volume did.

Reviewers adapted by skimming. Several told us they would “read the first file carefully and then pattern-match through the rest.” One senior engineer described it plainly: “I look at the structure. If the structure seems right, I approve. I don’t have time to trace every code path anymore.”

This is rational behaviour. It is also exactly the wrong response to agent-generated code, where structural coherence is the thing agents are best at and logic errors in edge cases are the thing they’re worst at.

The coherence problem

Agent-generated code looks good. This is not a trivial observation.

Human-written code carries fingerprints of its creation: inconsistent naming, varying comment density, slight stylistic differences between the code written at 9am and the code written at 4pm. These imperfections serve as attention anchors during review. A reviewer’s eye catches an inconsistency, pauses, reads more carefully. The imperfections create friction, and friction creates attention.

Agent-generated code is uniform. Naming is consistent. Style is consistent. Comments appear at regular intervals. The code reads smoothly. Reviewers reported that agent-generated code “felt like it was already reviewed.” Multiple reviewers used the word “polished” to describe code that contained serious logic errors.

We measured review time per line of code. Reviewers spent an average of 4.2 seconds per line on human-written code and 1.7 seconds per line on agent-generated code. They moved through agent code more than twice as fast, and they caught less than half as many defects.

The smoothness is a trap. It signals quality to the human brain. And the signal is false.

Assumption leakage

The third mechanism is subtler. When a reviewer knows the code was written by an agent (and in practice, they almost always know), they bring a different set of assumptions to the review. Several reviewers told us they expected the agent to have handled “the basics” correctly. Their review focused on architecture and integration, not on logic and correctness.

One reviewer put it this way: “I figure the AI has already tested the obvious stuff. I’m looking for things it wouldn’t know about, like our business rules.”

The problem is that agents are quite good at handling business rules that are documented and quite bad at handling the undocumented assumptions that experienced developers carry in their heads. The reviewer is looking in exactly the wrong place.

The fatigue curve

We also tracked review effectiveness over time. In the first two weeks of agent adoption, reviewers caught 41% of defects in agent code. By week eight, it was 19%. The gap widened as the novelty wore off and the volume became normal.

This fatigue curve is critical. The period when reviewers are most alert to agent code is also the period when they’re producing the least of it. By the time agent output dominates the codebase, the review process has already degraded.

Two teams in our study attempted to address this by adding more reviewers. It helped marginally. Going from one reviewer to two increased detection from 23% to 31%. The improvement was real but modest, and the cost in engineering time was significant. You don’t solve a structural problem by doubling the humans.

Why independence is the fix

The traditional code review model assumes a particular ratio: one author, one reviewer, human-speed output. When you change one variable (the author becomes a machine producing at machine speed) the model breaks. Patching it with more reviewers or longer review windows treats the symptom.

The structural fix is machine-speed verification by an independent agent.

We’ve been testing this approach with two of the teams from our study. A second agent, with no access to the build agent’s code, reasoning, or intermediate steps, receives the same specification and the resulting output. Its job is to verify: does this code do what the specification says? Are the edge cases handled? Do the tests cover the actual behaviour?

The independence constraint is essential. When a reviewer (human or machine) can see the author’s reasoning, they anchor to it. They follow the author’s logic and check whether the logic is internally consistent. What they miss are the cases the author never considered. An independent verifier starts from the specification, not from the code. It asks “does this meet the requirements?” rather than “does this code make sense?”

Early results from this approach show defect detection at 72%, higher than human review of human code. The verification runs in minutes, not hours. And it doesn’t fatigue.

What this means for engineering organisations

The verification gap is not a temporary problem that will resolve as reviewers get better at reading agent code. It is a structural mismatch between a human-speed quality process and a machine-speed production process. The three forces we identified (volume overwhelm, coherence bias, and assumption leakage) are inherent to the interaction between human cognition and machine-generated output. Training won’t fix them. Process changes might mitigate them. But the only structural solution is verification that operates at the same speed and scale as production.

This has implications beyond code review:

Testing. If agents generate the code, agents should generate the tests independently. A build agent writing its own tests has the same blind spots in its tests as in its code.

Architecture. Architectural decisions made by agents need verification against system-wide constraints that the agent may not have visibility into. An independent agent that holds the architectural spec and checks each change against it catches drift that no individual code review would surface.

Documentation. Agent-generated documentation is often fluent and wrong in the same ways as agent-generated code. It reads well. It may not be accurate. Verification applies here too.

The organisations that figure this out early will have a genuine advantage. They’ll be able to move at agent speed without accumulating the quality debt that agent speed typically creates. The ones that don’t will discover the gap later, in production incidents, in customer-facing defects, in the slow accumulation of a codebase that passed review but never passed scrutiny.

The uncomfortable question

We presented these findings to an engineering leadership group recently. The first question was: “So human code review is dead?”

We don’t think so. Humans remain better than machines at evaluating whether code aligns with organisational context, whether it fits the team’s mental model of the system, whether it will be maintainable by the people who will maintain it. These are judgment calls that independent verification doesn’t replace.

But the defect-catching function of code review, the part that most teams consider its primary purpose, is already failing for agent-generated code. Recognising this is step one. Building verification systems that work at machine speed is step two. The teams that skip step one will keep approving pull requests that look right and aren’t.