Design Before Diff: Why Agent Orchestration Needs a Discipline Layer
The dominant workflow for AI coding agents is prompt-and-pray: describe what you want, hope the output is correct. We studied 11 engineering teams and found that teams using structured design canvases before agent execution produced 62% fewer cross-module defects and 3x faster review cycles.
The prompt-and-pray workflow
Here is how most engineering teams use AI coding agents today. A developer opens the agent, describes what they want in natural language, and the agent generates code. The developer reviews the output, makes corrections, maybe runs a few more prompts, and eventually opens a pull request. The reviewer looks at the diff, checks for obvious issues, and merges.
We’ve started calling this the “prompt-and-pray” workflow. Not to be dismissive. To be accurate. The workflow is structurally a gamble: describe your intent in prose, hope the agent interpreted it correctly, and catch any misinterpretation during review.
For small, self-contained tasks, this works well enough. Fix this bug. Add this field. Write a test for this function. The intent is narrow, the scope is clear, and the output is easy to verify.
For anything structural (a new module, a feature that crosses service boundaries, a refactor that touches multiple systems) prompt-and-pray breaks down. The intent is too complex to capture in a prompt. The agent lacks context about the system’s constraints and conventions. And the reviewer, looking at a large diff with no specification to compare against, is left asking a fundamentally weak question: “Does this code look okay?”
We studied 11 engineering teams to understand what happens when you insert a structured design step between intent and execution. The results were significant enough to change how we think about agent orchestration entirely.
The study
Between September 2025 and February 2026, we worked with 11 engineering teams across seven organisations. All were using AI coding agents for production development. Team sizes ranged from 5 to 18 engineers.
We divided the teams into two groups based on their existing practices. Six teams used the standard prompt-and-pray workflow. Five teams had adopted what we’re calling “structured design canvases,” an explicit design step before agent execution.
The canvas implementations varied across teams, but all shared common elements: a written description of the change’s intent, explicit constraints the implementation must respect, acceptance criteria that could be verified, and references to relevant architectural decisions or existing patterns.
We tracked four outcomes across both groups over five months:
- Cross-module defect rate
- Code review cycle time (from PR open to merge)
- Rework rate (how often merged code required follow-up changes within two weeks)
- Reviewer confidence (self-reported by reviewers on a 1-5 scale)
What the numbers showed
The structured teams outperformed the prompt-and-pray teams on every metric we tracked. Not by small margins.
Cross-module defects. The structured group averaged 3.1 cross-module defects per 1,000 lines of changed code. The prompt-and-pray group averaged 8.2. A 62% reduction.
Review cycle time. The structured group averaged 2.4 hours from PR open to merge. The prompt-and-pray group averaged 7.8 hours. Just over 3x faster.
Rework rate. 8% of merged PRs in the structured group required follow-up changes within two weeks. In the prompt-and-pray group, 31%.
Reviewer confidence. Structured group reviewers reported an average confidence of 4.1 out of 5. Prompt-and-pray reviewers reported 2.6.
These are large differences. We expected to see improvement. We did not expect the margins to be this wide, particularly on review cycle time.
Why design before execution works
The numbers are clear. The more interesting question is the mechanism. Why does a structured design step produce such dramatically better outcomes when working with AI agents?
We identified four mechanisms through our interviews and analysis.
Mechanism 1: Intent becomes explicit and verifiable
In the prompt-and-pray workflow, intent lives in the developer’s head. Some of it makes it into the prompt. Some doesn’t. The agent infers the rest from patterns in its training data and whatever context it can access.
A design canvas forces intent out of the developer’s head and into a structured artefact. “Build a payment retry system” becomes:
- Retry up to 3 times using exponential backoff starting at 200ms
- Use the existing circuit breaker pattern from
lib/resilience - Log each retry attempt with the transaction ID and attempt number
- After max retries, enqueue to the dead letter queue (do not throw)
- Must not introduce a direct dependency on the payment provider SDK
The agent now has constraints to work within. The reviewer has a specification to verify against. The difference between “does this code look okay?” and “does this code satisfy these five requirements?” is the difference between opinion and verification.
Mechanism 2: Constraints prevent architectural drift
The most expensive agent-generated defects we’ve observed are not logic bugs. They’re architectural violations. The agent introduces a new pattern where an established one exists. It creates a direct dependency where the architecture requires an abstraction layer. It puts business logic in the API handler because that’s the simplest local solution, even though the team’s convention is to keep handlers thin.
A design canvas is where constraints live. “Use the existing repository pattern.” “Do not import directly from the payments module; use the public API.” “Follow the error handling conventions in CONVENTIONS.md.” These constraints act as guardrails for the agent. Without them, the agent will choose whatever approach produces working code most efficiently, regardless of whether it fits the system’s architecture.
In our structured teams, architectural violations in agent-generated code dropped by 74% compared to the prompt-and-pray teams. The canvas didn’t just make the code better. It kept the codebase coherent.
Mechanism 3: Review becomes verification instead of exploration
Code review in the prompt-and-pray workflow is exploratory. The reviewer reads the diff, tries to understand what the code is doing, forms a mental model of whether it’s correct, and looks for things that seem off. This is cognitively expensive, slow, and inconsistent. What one reviewer catches, another misses.
When a design canvas exists, review becomes verification. The reviewer has a checklist: does the code implement exponential backoff starting at 200ms? Does it use the existing circuit breaker? Does it log with transaction ID and attempt number? Each requirement is either met or not. The review is faster, more thorough, and more consistent across reviewers.
This explains the 3x improvement in review cycle time. Reviewers in the structured group told us they spent less time per review but caught more issues. They weren’t guessing at intent. They were checking against a spec.
One reviewer put it directly: “Before the canvas, reviewing agent code felt like archaeology. I was trying to figure out what the developer wanted, then whether the agent delivered it. Now I just read the canvas and check the boxes. I’m faster and I catch more.”
Mechanism 4: The canvas is a knowledge transfer artefact
Design canvases persist after the agent session ends. They become part of the project record. When a new developer, or a future agent session, needs to understand why a module was built the way it was, the canvas provides the answer.
This turned out to be a secondary benefit we hadn’t anticipated when we started the study. Three of our structured teams reported using old canvases as context for new agent sessions. “Here’s the design canvas from when we built the original module. The new feature should be consistent with these decisions.”
The canvas becomes a form of institutional memory. Not a comprehensive one, but a targeted one: it captures the intent, constraints, and requirements for specific pieces of work. Over time, a library of canvases builds a queryable record of why the codebase looks the way it does.
What a design canvas looks like
There is no single correct format. Our five structured teams each used slightly different approaches. But the effective ones all contained these elements:
Intent statement. One or two sentences describing what the change should accomplish and why. Not “add retry logic” but “add retry logic to the payment processing pipeline so that transient failures from the provider don’t result in lost transactions.”
Constraints. Explicit boundaries the implementation must respect. Architectural patterns to follow. Modules that should not be directly imported. Performance budgets. Naming conventions.
Acceptance criteria. Specific, verifiable conditions that the implementation must satisfy. These should be testable, either manually or through automated checks. “Retries use exponential backoff” is verifiable. “Code is well-structured” is not.
Context references. Pointers to existing code, architecture decision records, or patterns that the implementation should be consistent with. “See the circuit breaker implementation in lib/resilience/breaker.ts for the pattern to follow.”
Non-goals. What this change explicitly does not do. This is surprisingly useful for agents, which tend to be helpful to the point of scope creep. “This change does not modify the existing webhook delivery system. Do not refactor existing retry logic.”
The teams that got the most value from canvases were the ones that kept them concise. A canvas is not a design document. It’s a structured brief. The effective ones we saw were between 15 and 40 lines. Enough to constrain the agent. Not so much that writing the canvas becomes a barrier to getting work done.
The overhead question
The immediate objection to structured design canvases is overhead. “If I have to write a spec before I can use the agent, doesn’t that defeat the purpose of using an agent?”
Our data says no. Clearly.
We measured time from task start to merged PR across both groups. The structured group’s end-to-end time was, on average, 18% shorter than the prompt-and-pray group. Despite spending 10-15 minutes writing a canvas before starting.
The time savings came from three places: less rework (the code was right more often on the first pass), faster reviews (verification is quicker than exploration), and fewer follow-up fixes (merged code stayed merged).
The overhead is real. But it’s more than offset by the reduction in downstream waste. This shouldn’t surprise anyone who has worked in engineering for a while. The “slow down to go fast” principle is well established. What’s new is that it applies even more strongly when the executor is an AI agent, because agents amplify both good inputs and bad ones.
The prompt-and-pray tax
We’ve started thinking about the difference between structured and unstructured agent workflows as a tax. Every team using prompt-and-pray is paying it. The tax shows up as:
- Rework cycles that eat 20-30% of agent-generated code
- Review cycles that are 3x longer than they need to be
- Architectural drift that compounds into expensive reconciliation work
- Knowledge loss, because the prompt disappears after the session but the code remains without context
Most teams don’t see the tax because they’re comparing agent-assisted velocity to their pre-agent velocity. And agents are faster, even with the tax. But the comparison misses the point. The question isn’t “are agents faster than no agents?” The question is “how much of the agent’s value are you actually capturing?”
Based on our data, prompt-and-pray teams are capturing roughly 40-50% of the potential value. The rest is lost to rework, slow reviews, and accumulated structural problems.
From workflow to discipline
We chose the word “discipline” deliberately. A design canvas is not a tool. It’s a practice. A habit that teams build into how they work.
The teams in our study that saw the largest improvements were the ones that treated the canvas as non-negotiable for any change above a certain size threshold. Bug fixes and small changes went straight to the agent. New features, refactors, and anything crossing module boundaries required a canvas first.
The threshold varied by team. Some drew the line at any task estimated above two hours. Others used a simpler rule: if the change touches more than one module, write a canvas. The specific rule mattered less than its consistent application.
What we’re arguing is that AI coding agents need a discipline layer. The agent’s capability is immense. Its judgment about how to apply that capability within a specific codebase and organisational context is limited. The discipline layer, the design canvas, the explicit constraints, the verifiable acceptance criteria, is how human judgment directs agent capability.
Without it, you get prompt-and-pray. Fast, sometimes correct, structurally corrosive over time.
With it, you get something closer to what software engineering is supposed to be: intent, design, execution, verification. In that order.
Practical starting points
For teams that want to adopt structured design canvases, we observed a few patterns that separated the teams that made it stick from the ones that tried and abandoned it.
Start with templates, not mandates. Give teams a canvas template and encourage its use. Don’t make it a gate in the CI pipeline on day one. Let people discover the value before you enforce the process.
Keep canvases in the repository. Store them alongside the code they describe. This makes them available to future agent sessions and keeps them version-controlled. We saw teams use a design/ or canvases/ directory at the project root.
Timebox canvas writing. If a canvas takes more than 15 minutes, it’s too detailed or the task isn’t well enough understood to start yet. The canvas should capture intent and constraints, not implementation specifics.
Review the canvas, not just the code. Some teams started doing canvas reviews before agent execution. A 5-minute review of the canvas by a second engineer caught constraint gaps and missing context before any code was written. This was, per hour of engineer time invested, the highest-leverage quality practice we observed in the entire study.
Use canvases as agent context. Feed the canvas to the agent as part of the prompt. The agent should know not just what to build but what constraints to respect and what patterns to follow. This is the entire point.
The dominant AI coding workflow is backward. It starts with execution and hopes that intent was captured in the prompt. Reversing that order, making design explicit before the first line of code is generated, produced the largest quality improvements we’ve measured in six months of research on agent-assisted engineering.
Design before diff. It sounds obvious. In practice, almost nobody does it.