Agent Software Engineering #2 | Rethinking Code Review

This article is part of the "Agent Software Engineering" series. Previous installment: Agent Software Engineering | Stop Measuring AI's Work with Human Rulers: Agent-Native Task Estimation.

"Manual ancient-style programming died in 2025. Manual eyeball code review will die in 2026." — Ankit Jain, Latent Space, March 2, 2026

A War Doomed to Fail

In March 2026, Latent Space published an extremely provocative article titled "How to Kill the Code Review"^[1]. Citing data analysis by Faros AI covering over 10,000 developers and 1,255 teams, the article presented a disturbing chart:

Chart showing AI adoption impact on PR metrics

Task throughput for teams with high AI adoption increased by 21%
PR merge rates surged by 97.8%
However, the median review time also skyrocketed by 91.1%

This data describes a classic production-consumption imbalance: AI has accelerated code production onto an exponential curve, while human review capacity remains stuck on a linear or even flat trajectory. These two curves are bound to intersect, and that intersection point is the system's breaking point.

GitHub's Octoverse report^[2] further confirms this trend: by the end of 2025, monthly code pushes exceeded 82 million, with 43 million PRs merged, and approximately 41% of new code was AI-assisted. Meanwhile, Addy Osmani's research shows that AI-generated PRs increased in size by about 18%, with accident rates per PR rising by roughly 24% and change failure rates climbing by about 30%.

The severity of the problem lies not only in the quantity imbalance but also in the deterioration of quality. CodeRabbit^[3] analysis of 470 PRs found that AI-written code averaged 10.83 issues per PR, compared to only 6.45 for human-written code. AI-generated code exhibits 1.4 to 1.7 times more logical errors than humans, with an even wider gap in readability issues. While AI code appears neat and consistent, it frequently violates local codebase naming conventions, architectural patterns, and idioms.

In other words: AI is generating more, larger, and harder-to-review code changes, while human review capabilities have not grown correspondingly. This is a war doomed to fail.

However, most people's reactions are misguided. They either attempt to "use AI to assist humans in reviewing" (which still fundamentally relies on humans reading code) or try "using AI for initial screening with humans making final approvals" (merely shifting the bottleneck slightly downstream). None of these approaches address the root cause.

To truly solve this problem, we must return to first principles.

First Principles Breakdown: What Problem Is Code Review Actually Solving?

Code Review Is a Means, Not an End

Code Review was never an "end goal" but rather a means to an end. The single problem it truly aims to solve is:

"How do we ensure changes entering production won't break the system?"

However, over decades of practice, this simple objective has accumulated increasingly more responsibilities. Today's Code Review actually carries five entirely different functions simultaneously:

Function	Core Question	Typical Manifestation
Correctness Verification	Is the code doing what it should?	Reviewing logic, boundary conditions, error handling
Security Checks	Are there vulnerabilities?	Reviewing authentication, authorization, input validation
Architectural Consistency	Does it fit the system design?	Reviewing module boundaries, dependencies, design patterns
Knowledge Transfer	Team members understanding the codebase	Learning others' code through review
Compliance & Standards	Style, naming, conventions	Code formatting, naming standards, documentation requirements

The core insight from first principles is:

These five functions do not need to be bound to the same process, nor do they need to be tied to the specific action of "humans reading diff line-by-line."

This is similar to early mobile phones bundling phones, cameras, music players, and GPS navigators together for convenience, but the optimal solution for each function evolves independently. Bundling these five functions into Code Review was merely the lowest-cost compromise in the era of manual coding.

The Collapse of Three Implicit Assumptions

Traditional Code Review rests on three implicit assumptions, all of which are simultaneously collapsing in the age of AI coding:

Assumption 1: "Reading code is the best way to verify correctness"

This assumption was never particularly robust. Human eyeball review detects defects at a rate of only about 60-70%, and review quality deteriorates sharply as diff size increases. When diffs exceed 400 lines, review effectiveness approaches zero. We simply accepted this reality because we lacked better alternatives.

But now we have superior alternatives: formal verification, property testing, fuzz testing, and contract checking. These deterministic methods far surpass human visual scanning in verification capability.

Assumption 2: "Code is the core asset requiring long-term maintenance"

When a module requires three months to build and six months to maintain, every line of code is an investment demanding careful scrutiny. However, when AI can rewrite an entire module in 30 minutes, the nature of code fundamentally changes. It transforms from a "long-term asset requiring meticulous maintenance" into a "disposable artifact that can be regenerated at any time."

The true long-term asset is no longer the code itself, but the Specification (Spec). This is precisely the core philosophy of the StrongDM Attractor^[4] framework:

"Code must not be written by humans. Code must not be reviewed by humans."

Code becomes a compilation product of the Spec. You wouldn't line-by-line review the assembly code output by a compiler; the same logic applies here.

Just as the Obsidian team pledged in 2023: Develop with the philosophy that software is ephemeral and files are more important than applications.

"Obsidian pledges to never grow beyond 12 people, never accept venture capital, and never collect personal or analytics data. We will continue developing with the philosophy that software is ephemeral and files are more important than applications, using open and durable formats."

This "File over App" philosophy aligns perfectly with our concepts in the AI era. While the pledge itself is commendable, its true power lies not in "guaranteeing they will always do this," but in "even if they stop doing so in the future, you (the user) can retreat with dignity."

This is the true advantage of "File First / Spec Specification."

Assumption 3: "Quality checks must occur after code completion"

This is a typical manifestation of traditional "post-factum inspection" thinking. First principles tell us a truth manufacturing proved long ago: Quality should be built-in, not inspected-in.

W. Edwards Deming (the father of modern quality management) stated half a century ago: "Cease dependence on inspection to achieve quality." While the software industry has embraced this mindset in CI/CD, Code Review remains stuck in "post-factum approval" mode.

Verification Asymmetry: The Overlooked Fundamental Nature

In my previous article on [AI Native Task Estimation], I introduced a core concept: Human Time Anchoring Bias. Experienced developers (and LLMs trained on human content) unconsciously anchor AI task estimates to human timelines.

In the realm of Code Review, there exists a completely analogous bias I call "Review Time Anchoring":

"We assume the time required to review code should be proportional to the time taken to write it. When humans spend a day writing code, they might spend two hours reviewing it, a ratio of roughly 1:4 to 1:8. But when AI generates equivalent code in 5 minutes, our intuition still tells us we should spend two hours reviewing it. This isn't an efficiency issue; it's a flawed measurement framework."

But the deeper insight is Verification Asymmetry:

The cost of verifying whether a result is correct is inherently far lower than the cost of generating that result.

This property manifests in mathematics as the P ≠ NP conjecture, in cryptography as hash verification versus collision attacks, and in software engineering as: running a set of tests (milliseconds) is far faster than writing the code being tested (hours).

Understanding this asymmetry clarifies the future direction of Code Review: We should not have humans (or AI) "understand" code to judge its correctness. Instead, we should build deterministic verification systems that make correctness a property machines can rapidly verify.

From Reviewing Code to Reviewing Intent: A Paradigm Shift

Shifting the Review Point Upstream: From Code Layer to Intent Layer

Traditional Code Review focuses on the code layer:

Human writes code → Human reviews code → Merge → Deploy

In the AI Coding era, the correct review point should shift upstream to the intent layer:

Human writes Spec → AI generates code → Machine verifies code meets Spec → Human accepts final result

This shift is not about "letting AI help humans review," but rather fundamentally changing the human's role in the process:

Old Role: Code Reviewer (Did you write this correctly?)
New Role: Intent Definer (Are we solving the right problem with the right constraints?)

Human judgment should not be wasted on questions like "Is the boundary condition of this for-loop correct?" which machines can verify perfectly. Instead, it should focus on higher-order questions only humans can answer: "Are we solving the right problem?", "Are the constraints complete?", "Do the acceptance criteria cover key scenarios?"

The core argument in the Latent Space article aligns perfectly with this:

"Human-in-the-loop approval moves from 'Did you write this correctly?' to 'Are we solving the right problem with the right constraints?' The most valuable human judgment is exercised before the first line of code is generated, not after."

Spec as the Control Plane

Within the Agent Software Engineering framework, we need to redefine the roles of three planes:

Control Plane: Spec, defining "what to do" and "what counts as correct"
Data Plane: Code, the compiled product of the Spec
Execution Plane: Agent, the executor transforming Spec into Code

In this architecture, the object of review shifts from code to Spec. Why is this reasonable? Because Specs possess several key characteristics that code lacks:

Specs have far higher information density than code. A single Spec rule like "All monetary amounts must use the Money type with two decimal places of precision" might correspond to implementations scattered across dozens of files and hundreds of lines of code. Reviewing one Spec rule is vastly more efficient than reviewing all its corresponding code implementations.

Spec verification can be automated. Once a Spec is formalized (even semi-formally), it can become an automated verification rule. Whether code satisfies the Spec can be verified through deterministic means like testing, static analysis, and contract checking, without requiring human subjective judgment.

Specs change far less frequently than code. Business rules and architectural constraints evolve much slower than implementation code. This means human review workload maintains a manageable ratio relative to system complexity.

The Second Spring of BDD

Here lies an interesting historical loop. When Behavior-Driven Development (BDD) was proposed by Dan North in 2003, its philosophy was ahead of its time: describe expected behaviors in natural language, then automate them as tests. However, BDD never truly gained widespread adoption. The core reason was that **in the era of manual coding, writing Specs was seen as "extra work." You're already writing code; why write a Spec first?

This was actually one of the reasons I was initially skeptical about Spec-driven AI programming. But I realized this was a misunderstanding born from judging new changes with past experiences.

In the Agent era, the equation flips completely:

Old Equation: Write Spec (extra cost) + Write Code (main work) = Increased total cost
New Equation: Write Spec (sole human work) + AI generates code (near-zero marginal cost) = Drastically reduced total cost

The Spec is no longer "extra work"; it is the only human work. The question BDD couldn't answer, "Who writes those Specs?" now has a clear answer: The core function of human engineers is to write Specs.

Moreover, writing Specs in natural language is exactly what LLMs excel at understanding and executing. This creates a perfect division of labor:

Given the user enters the wrong password more than 5 times during login
When the user attempts to log in again
Then the account should be locked for 30 minutes
And the system should send a security notification email

Humans write the Spec. Agents implement it. The BDD framework verifies it. You don't need to read the implementation code unless verification fails.

AI Native Code Review: A Five-Layer Trust Model

Having understood the direction of the paradigm shift, what is the concrete alternative?

The answer is not finding a single "silver bullet" to replace Code Review, but building a multi-layer deterministic verification system: the Swiss Cheese Model. No single layer is perfect, but when you stack enough imperfect layers, the holes don't align.

Swiss Cheese Model diagram for AI Code Review

(This image is from a latent.space blog post, though my five-layer model differs slightly from theirs.)

Layer 0: Compile-Time Guardrails: Type Systems and Static Analysis

This is the cheapest, fastest, and most reliable layer. Type systems don't fatigue, don't miss things, and aren't intimidated by diff size.

In the AI Coding era, the value of type systems hasn't diminished; it has greatly increased. This is because the most common issues in AI-generated code are exactly what type systems excel at catching: interface mismatches, type conversion errors, and missed null handling.

Some argue that since AI can write code, programming languages don't matter anymore. I believe the opposite: programming languages are even more important because some languages come with built-in compile-time guardrails. This is one reason I choose Rust.

Practical Advice: If your project still uses dynamic typing without type annotations, now is the best time to migrate. Besides Rust, TypeScript for JavaScript and mypy for Python all provide the first line of deterministic defense for AI-generated code.

Layer 1: Contract Verification: Preconditions, Postconditions, Invariants

// Contract definition (written by humans) example
@contract
def transfer_money(from_account, to_account, amount):
    # Preconditions
    require(amount > 0, "Amount must be positive")
    require(from_account.balance >= amount, "Insufficient funds")
    # Invariant
    total_before = from_account.balance + to_account.balance
    # ... AI generated implementation code ...
    # Postconditions
    ensure(from_account.balance == old.from_account.balance - amount)
    ensure(to_account.balance == old.to_account.balance + amount)
    ensure(from_account.balance + to_account.balance == total_before)

The core value of contract verification is that it transforms "correctness" from a fuzzy concept requiring human subjective judgment into a formalized attribute that machines can precisely verify. AI can generate any implementation; as long as it satisfies the contract, we don't care about the specific implementation details.

Layer 2: BDD Acceptance Testing: Humans Define "What is Correct"

This layer is the core anchor point for human review. Humans no longer review code; they review and write acceptance criteria:

Feature: Payment Risk Control
  Scenario: Unusually large transaction triggers risk control
    Given user's historical monthly average spending is 5000 yuan
    When user initiates a single transaction of 50000 yuan
    Then the transaction should be temporarily frozen
    And the system should send an SMS verification
    And the transaction should appear in the manual approval queue
  Scenario: Normal spending does not trigger risk control
    Given user's historical monthly average spending is 5000 yuan
    When user initiates a single transaction of 3000 yuan
    Then the transaction should complete immediately

These acceptance criteria are written by humans; they are the core artifacts that need "Review." However, reviewing Specs is an order of magnitude more efficient than reviewing Code. This is because Specs are in business language understandable by humans, not implementation details.

Layer 3: Adversarial Multi-Agent Verification

This layer introduces a unique advantage of Agent Software Engineering. A core problem with traditional Code Review is that the person writing the code and the person reviewing it share the same cognitive biases. When the reviewer knows the implementation intent, they often unconsciously "fill in" the code's correctness.

Multi-agent adversarial verification eliminates this problem through architectural design:

┌─────────────────────────────────────────────────┐
│             Arbiter Agent                       │
│      (Synthesizes all signals, makes final call)│
└──────────┬──────────┬──────────┬────────────────┘
           │          │          │
     ┌──────▼──┐ ┌─────▼────┐ ┌──▼───────────┐
     │Blue Agent│ │Red Agent │ │ Audit Agent  │
     │(Implements)│ │(Breaks)  │ │(Independent) │
     └─────────┘ └──────────┘ └──────────────┘

Blue Agent: Implements functional code
Red Agent: Attempts to break the code, generating adversarial test cases and boundary conditions
Audit Agent: Independently checks security, performance, and compliance without knowledge of the implementation process
Arbiter Agent: Synthesizes all signals to make the final judgment

The key design principle is separation of concerns + mutual distrust. This mirrors traditional financial auditing logic: the entity doing the books and the entity auditing them must be different. In the Agent world, we use different model instances, different system prompts, and different context windows to ensure review independence.

Here's an interesting technique mentioned in a Latent Space Engineering blog post: set up a competitive framework for Review Agents, telling each sub-agent that the one finding the most legitimate issues gets a "reward." This leverages the characteristic of LLMs being more meticulous under competitive prompting.

Layer 4: Permission Sandboxing: Architecting the Principle of Least Privilege

Most Agent frameworks handle permissions in an all-or-nothing manner. An Agent either has shell access or it doesn't. But granularity is crucial.

# Task-level permission definition
task: fix-date-parsing-bug
permissions:
  files:
    read: ["src/utils/dates.py", "tests/test_dates.py"]
    write: ["src/utils/dates.py", "tests/test_dates.py"]
  network: deny
  env_vars: deny
  escalation_triggers:
    - pattern: "auth|authentication|authorization"
      action: require_human_review
    - pattern: "schema.*migration|ALTER TABLE"
      action: require_human_review
    - pattern: "dependency.*add|requirements.*txt"
      action: require_human_review

The value of permission sandboxes is that even if all previous layers fail, the Agent cannot touch what it shouldn't. This is the last substantive barrier in depth defense.

Layer 5: The Final Line of Defense in Production: Observability + Fast Rollback

Even if the previous five layers all fail, the system still needs to quickly detect and fix issues in the production environment:

Canary Releases: Validate new changes on 1% of traffic first
Real-time Observability: Real-time monitoring of error rates, latency, and resource consumption
Automatic Rollback: Anomalous metrics trigger second-level automatic rollbacks
Feature Flags: Any new feature can be turned off without deploying new code

The philosophy of this layer acknowledges a reality: No matter how perfect your verification system is, bugs will enter production. Traditional Code Review attempts to eliminate all bugs before deployment, which is unrealistic in the AI era.

The correct strategy is: Minimize the impact scope of bugs and maximize repair speed.

Estimation Paradigm for AI Native Review

In previous articles on AI Native Task Estimation, we proposed two core measurement units:

Rounds: Iteration count for single-agent sequential execution
Waves: Batch count for multi-agent concurrent execution

For Code Review, we can redefine the cost metrics of Review using the same framework:

Traditional Metrics (Obsolete)

Review Cost = Number of People × Time per Person Reviewing
            ≈ 2 people × 1.5 hours = 3 person-hours / PR

This metric assumes Review is a labor-intensive activity where cost is proportional to code volume. This assumption彻底 fails in the AI era.

AI Native Metrics

Verification Cost = Σ (Verification Waves per Layer × Compute Cost per Wave)
Wave 0: Static Analysis + Type Check      → ~10s, zero human
Wave 1: Contract Verification + Unit Test → ~30s, zero human
Wave 2: BDD Acceptance Test               → ~2min, zero human
Wave 3: Multi-Agent Adversarial Review    → ~5min, low token cost
Wave 4: Human Spec/Result Acceptance      → Only when necessary, ~15min

Note the key characteristics of this structure:

Increasing Cost: Later layers cost more but trigger less frequently
Determinism First: The first three layers are completely deterministic, involving no human judgment
Human Fallback: Humans intervene only at the last layer, and only when previous layers cannot confirm
Decoupled from Code Volume: Whether AI generates 100 lines or 10,000 lines, the cost of Wave 0-3 remains almost unchanged

This solves the fundamental contradiction described in the Latent Space article: "if the code review takes longer than the AI took to write the feature, the math doesn't make sense to higher-ups." In the new paradigm, verification cost maintains a reasonable ratio to generation cost.

Migration Path: From Traditional to AI Native

Paradigm shifts don't happen overnight. Here is a gradual migration path:

Phase 1: Augment

Introduce automation layers into existing Code Review processes:

Deploy AI Code Review tools (CodeRabbit, Graphite Diamond, etc.) as the first screening layer
Add static analysis and contract checks in CI
Reviewers focus only on high-level issues not covered by AI tools

Human Role: Still the final approver of code, but workload reduced by 30-50%.

Phase 2: Separate

Explicitly separate the five functions of Review, handling each with optimal means:

Compliance & Standards → Automated linters (zero human)
Correctness Verification → Contract checks + BDD tests (zero human)
Security Checks → Dedicated security scanning + adversarial testing (zero human)
Architectural Consistency → AI Agent review + architectural constraint config (low human)
Knowledge Transfer → Auto-generated codebase docs + change summaries (low human)

Human Role: Shift from "reviewing all code" to "reviewing architectural decisions and exceptions."

Phase 3: Elevate

Completely shift the human review point from the code layer to the Spec layer:

Establish a Spec-driven development process
All new features start with BDD Specs, not code
AI Agents generate code under Spec constraints with multi-layer automatic verification
Humans only review Specs and final acceptance results

Human Role: Definer of intent and acceptor of final results, no longer touching code diffs.

Phase 4: Autonomize

Build a fully autonomous verification pipeline:

Multi-agent adversarial verification replaces all human review
Humans intervene only under specific trigger conditions (security-sensitive, architectural changes, anomalous metrics)
The system possesses self-healing capabilities—automatically generating fixes, verifying, and deploying upon detecting issues

Human Role: Architect of the system and handler of exceptions, absent from daily workflows.

Unresolved Issues

To be honest, this paradigm shift still has several key issues without good answers yet:

Issue 1: Who Reviews the Spec?

We shifted the review point from code to Spec, but Specs themselves can contain errors. Writing a complete Spec is arguably no simpler than writing code. As a reader pointed out in the Latent Space comments, "advocates of spec-driven development are too naive about the difficulty of writing complete Specs." This is a real challenge.

However, there is a key difference between Spec errors and code errors: Spec errors usually expose themselves during acceptance testing (because system behavior doesn't match expectations), whereas code-level bugs might look perfectly correct from a Spec perspective but introduce subtle issues in implementation. The feedback loop for Spec errors is shorter, which partially mitigates this problem.

Issue 2: Knowledge Boundaries of Agents

Current AI Agents lack deep understanding of business context. They don't know what decisions you made in last week's client meeting or that your product roadmap just pivoted. As Greg Foster of Graphite said, "Real code review demands domain expertise."

The solution direction might be better context injection mechanisms, structuring business decisions, Architectural Decision Records (ADR), and product roadmaps into context consumable by Agents. However, current context windows and retrieval-augmented technologies are not yet sufficient to fully solve this. As memory systems mature, I believe this will soon cease to be a problem.

Issue 3: Accountability, Who Takes the Blame?

If AI-generated code causes a security incident, who is responsible? Traditional Code Review has a simple but effective mechanism: behind the Approve button is a real engineer who takes responsibility for that decision.

In a fully automated verification pipeline, accountability becomes blurred. This is not just a technical issue but an organizational and legal one. Before this is resolved, completely eliminating human approval is impossible in many organizations, especially in regulated industries like finance, healthcare, and aviation.

However, if the answer to Issue 1 is that human developers are responsible for reviewing Specs, then Issue 3 is naturally resolved. And Issue 2 will be solved quickly as the AI ecosystem evolves.

Conclusion: Not Reading Faster, But Not Needing to Read

Let's return to the opening data: PR merge rates surged 97.8%, and review time skyrocketed 91.1%.

Faced with this data, the wrong reaction is "we need faster Review tools." The correct reaction is "we need to rethink the very form of Review itself."

Code Review is not a tradition from twenty years ago. The Latent Space article reminds us that code review only truly became widespread between 2012 and 2014. Before that, many software teams didn't do line-by-line code review but still successfully delivered software. What did they rely on? Testing, incremental releases, and fast rollbacks—mechanisms that remain effective today.

Code Review checkpoints have shifted before. We moved from waterfall sign-offs to continuous integration. We can shift again, from reviewing code to reviewing intent, from post-factum inspection to built-in quality, from human eyeball scanning to deterministic verification.

The core formula of Agent Software Engineering is validated once again here:

Spec is the Control Plane, Code is the Data Plane, Agent is the Execution Plane.

The future of Review is not "AI helping humans read code," but "humans not needing to read code at all."

Humanity's most precious cognitive resources—judgment, creativity, and deep business understanding—should be used to define "what is correct" (Spec), not to check "is this code written correctly" (Review).

The future is: Fast release, comprehensive observability, ultra-fast rollback.

Not: Slow review, missed bugs, debugging in production.

We cannot surpass machines in reading speed. We must surpass them in the quality of thinking, upstream, where decisions truly matter.

References

[1] "How to Kill the Code Review": https://www.latent.space/p/reviews-dead
[2] Octoverse Report: https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/
[3] CodeRabbit: https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
[4] StrongDM Attractor: https://github.com/strongdm/attractor