Agent Software Engineering #4 | When the Agent Writes the Code, Who Says 'Merge Ready'?

Series Articles:

Agent Software Engineering Series Header Image

In the first two articles of this series, we deconstructed two critical issues: traditional Code Review workflows collapse when Agents generate dozens of PRs daily (#2); and Git's mental model becomes a bottleneck when multiple Agents work concurrently (#3).

These two problems point to the same fundamental gap: After an Agent finishes coding but before the code is merged into the main branch, there lacks a deterministic, machine-executable quality gate.

In traditional workflows, this gate is a human reviewer. They open the PR, read hundreds of lines of diff, simulate the code's behavior mentally, judge "Is this code correct?", and then click Approve. This process relies on human experience, attention, and time—three resources that are not scalable in the Agent era.

CI tests serve as a machine gate, but they only check "whether existing tests pass," not "whether the Agent's implementation fulfills the intent of this specific task." An Agent can make all existing tests pass while implementing a completely wrong function. This happens because no one told the CI "what the intent of this task is."

This is the problem agent-spec solves.

agent-spec is not a "better Code Review tool"; it is a different paradigm. It shifts the object of review from code to contracts, moves the review timing from after coding to before coding, and changes the executor of verification from humans to machines.

I. Displacement of the Review Point

The core philosophy of agent-spec^[1] can be summarized in one sentence: Shift the human review point from "after the code is written" to "before the code is written."

In traditional workflows, human attention is allocated roughly as follows:

Write Issue (10%) → Agent Codes (0%) → Read Code Diff (80%) → Click Approve (10%)

Spending 80% of human time on "reading code diffs" is inefficient. What reviewers look for in diffs essentially falls into two categories: "Did the Agent do it right?" (functional correctness) and "Did the Agent do something it shouldn't have?" (overstepping boundaries). The former requires understanding business intent, and the latter requires understanding code boundaries. Both are implicit in the natural language description of the Issue and are not formalized.

agent-spec changes this allocation to:

Write Contract (60%) → Agent Codes (0%) → Read Explain Summary (30%) → Click Approve (10%)

Human work shifts from "reading code" to "writing Contracts." This involves defining the task's intent, technical decisions, file boundaries, and acceptance criteria in a structured way. This is a higher-value activity: You are defining "what is right" rather than guessing "is this right?" within a code diff.

After the code is written, humans no longer see hundreds of lines of diff, but a Contract-level execution summary: What intent did the Contract define? Which files did the Agent modify? Was it within the allowed scope? Did all bound tests pass? If all answers are "yes," approval becomes a 5-second operation.

II. Task Contract: The Four Elements of the Control Plane

The core abstraction of agent-spec is the Task Contract. It is a structured task specification containing four elements.

First Element: Intent

Not a vague Issue title, but a focused description. It explains what the task does, why it's being done, and in what context. Intent provides direction for the Agent and gives the reviewer a basis for judging "Is the Contract defined correctly?" at the end.

## Intent
Add a user registration endpoint to the existing auth module.
New users register with email + password; a verification email
is sent on success. This is the first step of the user system.

Second Element: Decisions

Technical choices that are already made. These are not "suggestions" or "considerations," but non-negotiable decisions. The Agent follows them without needing to choose a technical solution. This eliminates the环节 where Agents are most prone to deviation: technology selection.

## Decisions
- Route: POST /api/v1/auth/register
- Password hash: bcrypt, cost factor = 12
- Verification token: crypto.randomUUID(), stored in DB, 24h expiry
- Email: use existing EmailService, do not create a new one

Third Element: Boundaries

What the Agent can and cannot modify. Path-level boundaries are mechanically enforced by agent-spec's BoundariesVerifier. If an Agent modifies a file not in the Allowed Changes list, verification fails immediately.

## Boundaries
### Allowed Changes
- crates/api/src/auth/**
- crates/api/tests/auth/**
- migrations/
### Forbidden
- Do not add new dependencies
- Do not modify the existing login endpoint

The significance of this design is: The Agent's behavioral boundaries are no longer constrained by a probabilistic prompt like "Please do not modify other files," which the Agent may or may not follow. Boundary verification is deterministic; modifying a forbidden file results in a fail, with no gray area.

Fourth Element: Completion Criteria

BDD-style acceptance scenarios, where each scenario explicitly binds to a test function. This is not ordinary BDD; ordinary BDD requires a step definition layer to map natural language to code. agent-spec skips this intermediate layer, binding directly to existing test names:

## Completion Criteria
Scenario: Successful registration
  Test: test_register_returns_201_for_new_user
  Given no user with email "alice@example.com" exists
  When client submits the registration request
  Then response status should be 201
Scenario: Duplicate email rejected
  Test: test_register_rejects_duplicate_email
  Given a user with email "alice@example.com" already exists
  When client submits the same email for registration
  Then response status should be 409

There is a key writing principle here: The number of exception path scenarios should be greater than or equal to normal path scenarios. In the example above, if only "Registration successful returns 201" is written, the Agent could implement it in any way, including no input validation, no handling of duplicate emails, and no password strength checks. This is because these edge cases were not defined as acceptance criteria in the Contract.

Every additional exception scenario covers one more boundary condition the Agent might ignore. The process of writing a Contract forces you to think through all necessary handling cases before the Agent starts working.

agent-spec's DSL supports keywords in both Chinese and English:

场景：重复邮箱被拒绝
  测试：test_register_rejects_duplicate_email
  假设 已存在邮箱为 "alice@example.com" 的用户
  当 客户端提交相同邮箱的注册请求
  那么 响应状态码为 409

This is not merely decorative bilingualism; it allows Chinese-speaking teams to write and read Contracts in their native language, lowering the cognitive threshold for Spec writing.

III. The Seven-Step Workflow

Once the Contract is written, the entire process becomes a clear assembly line where three roles perform their respective duties:

Step 1: Human Writes Contract

agent-spec init --level task --lang en --name "User Registration"

Generates a template for humans to fill in the four elements. This is the环节 where human attention is most concentrated. 60% of human time is spent here.

Of course, AI can assist in writing the Contract here, but the Contract's DSL is very convenient for human developers to review and modify.

Step 2: Contract Quality Gate

agent-spec lint specs/user-registration.spec --min-score 0.7

Before handing the Contract to the Agent, check the quality of the Contract itself. agent-spec has 8 built-in linters that detect vague verbs ("handle", "manage"), lack of quantified constraints ("should be fast" instead of "< 200ms"), assertions that cannot be mechanically verified ("interface should be beautiful"), non-deterministic phrasing ("approximately"), scenarios lacking test bindings, and language that might induce Agent Sycophancy (phrasing like "find all bugs" encourages the Agent to fabricate problems to please you).

If the quality score is below the threshold, fix the Contract before proceeding. This is the first step of "shifting the review point upstream." Ensure the Contract itself is high-quality before the Agent starts working.

Step 3: Agent Reads Contract and Codes

agent-spec contract specs/user-registration.spec

Outputs a structured prompt fragment. The Agent is constrained by the Contract in three ways: Decisions tell it "how to do it," Boundaries tell it "what it can touch," and Completion Criteria tell it "what counts as done."

Step 4: Lifecycle Verification (Automatic Retry Loop)

agent-spec lifecycle specs/user-registration.spec \
  --code . --change-scope worktree --format json

This is the core of the entire process. The lifecycle runs four layers of verification:

L1 Structural Verifier: Performs code pattern matching against the Contract's "Must Not" constraints. If the Contract says "Forbidden to use .unwrap()", the verifier scans all source files for .unwrap() calls. Zero token cost, deterministic result.

L2 Boundaries Verifier: Checks if the files actually modified by the Agent are within the Allowed Changes scope of the Contract. Path glob matching, zero token cost, deterministic result.

L3 Test Verifier: Runs the tests bound in the Completion Criteria. Each Test: test_register_returns_201 is actually executed. This is the most important layer. It directly answers "Did the Agent's code meet the acceptance criteria defined in the Contract?"

L4 AI Verifier: For scenarios not covered by the first three layers, it provides AI analysis (currently supports stub mode and caller mode). The AI's judgment is not pass or fail, but uncertain; it does not claim certainty and requires final human judgment.

If the lifecycle fails, the Agent receives a structured failure_summary with precise reasons and evidence for each failed scenario. The Agent fixes the code based on this and re-runs the lifecycle. This retry loop can proceed automatically without human intervention:

Code → lifecycle → FAIL (2/5) → failure_summary → fix → lifecycle → PASS (5/5) ✓

agent-spec's run log records this process: "This Contract took 3 runs to pass."

Step 5: Guard Gate

agent-spec guard --spec-dir specs --code . --change-scope staged

A pre-commit hook or CI check. Runs lint + verify on all spec files under specs/. If any spec fails, the commit is blocked or the PR fails.

CI Failure Interception Screenshot

(This project also uses agent-spec for self-bootstrapping; the image shows CI failure interception)

Step 6: Contract Acceptance

agent-spec explain specs/user-registration.spec --format markdown

This is the final human review point. The Reviewer sees not a code diff, but a Contract-level execution summary: What was the intent? What decisions were made? Which files were modified (and was it within allowed scope)? What is the pass/fail status of all acceptance criteria?

The Reviewer needs to answer two questions: Is the Contract definition correct? Did all verifications pass? If both are "yes," approve.

If more context is needed, you can view the execution history:

agent-spec explain specs/user-registration.spec --history

"This task took the Agent 3 tries to pass" indicates that the Contract's acceptance criteria are indeed working.

Step 7: Stamp Archiving

agent-spec stamp specs/user-registration.spec --dry-run

Establishes a traceability chain from Contract → Commit via Git commit trailers. In the future, if someone asks "Why was this code written this way?", they can trace back from the trailer to the Contract, and see the complete intent and decisions within the Contract.

IV. Four Verdicts, Zero Ambiguity

agent-spec verification results have exactly four semantic meanings, no more, no less:

pass: The scenario is confirmed passed by a deterministic verifier or AI verifier.

fail: The verifier found a specific violation in the scenario, 附带 evidence (code snippets, test output, pattern matching results).

skip: No verifier covered this scenario. This usually means the bound test does not exist or the selector configuration is incorrect.

uncertain: The AI verifier analyzed the code but could not make a deterministic judgment, 附带 structured analysis reasoning and confidence levels.

The most important rule is: skip ≠ pass. A scenario that was not verified is not equal to a scenario that passed verification. is_passing requires failed == 0 AND skipped == 0 AND uncertain == 0. Only when all scenarios are deterministically verified as pass does the entire Contract count as passed.

This four-value semantics is more honest than the pass/fail binary system of most tools. It acknowledges a reality: Some things we can verify deterministically, some things we can only judge probabilistically, and some things we haven't covered at all. Distinguishing these three states, rather than lumping them all into "pass," is the foundation of building trust.

V. AI Verifier: Two Modes, One Protocol

In the four-layer verification pyramid, the first three layers (Structural, Boundaries, Test) are deterministic—zero token cost, zero false negatives. However, many quality dimensions (race conditions, resource leaks, implicit security vulnerabilities) fall outside the scope of these three layers.

The AI Verifier fills this gap, but its design philosophy differs from traditional AI Code Review tools: It does not pretend to make deterministic judgments. Its output is always uncertain, 附带 structured analysis and confidence. Humans (or higher-level verification systems) make the final ruling.

agent-spec supports two AI verification modes:

Caller Mode: The calling Agent (e.g., Claude Code) performs the AI verification itself. This is the default path for tool-first scenarios. After the lifecycle runs mechanical verification, it generates structured AiRequests for uncovered scenarios, writing them to .agent-spec/pending-ai-requests.json. The calling Agent (Claude Code, Codex, etc.) reads the request, analyzes the code with its own reasoning capabilities, generates an AiDecision, and merges it back into the final report via the resolve-ai command.

# Step 1: lifecycle discovers skipped scenarios, generates AI request
agent-spec lifecycle specs/task.spec --code . --ai-mode caller
# Step 2: Agent reads request, analyzes code, generates decision file
# Step 3: Merge AI decision into final report
agent-spec resolve-ai specs/task.spec --decisions decisions.json

In this mode, the AI verification output flows to humans rather than into the automated process. The Agent's retry loop is based on deterministic signals (test pass/fail), leaving the AI's analysis for humans to judge during the Contract Acceptance phase.

Backend Mode: Injects an independent AI backend via the Rust API. Suitable for orchestration systems (Symphony-like orchestrators) using different models for independent verification. The host program implements the AiBackend trait, injects it into spec-gateway, and AI verification is completed internally within agent-spec.

let report = gw.verify_with_ai_backend(code, Arc::new(my_backend)).unwrap();

Both modes share the same data structures (AiRequest / AiDecision), keeping agent-spec provider-agnostic.

VI. Integration with jj: Optional Accelerator, Not a Dependency

In the third article of the Agent Software Engineering series, we detailed the Agent-friendly features of jj (Jujutsu): everything is a commit (no staging area rituals), stable change IDs (invariant across amends), conflicts as data (doesn't block rebase), and operation logs (atomic rollback).

The integration point between agent-spec and jj is natural, but the design principle is jj as an optional accelerator, not a necessary dependency. All core functions work fully in a pure Git environment.

Change Scope: Unified Change Discovery

# Git environment
agent-spec lifecycle specs/task.spec --code . --change-scope staged
agent-spec lifecycle specs/task.spec --code . --change-scope worktree
# jj environment
agent-spec lifecycle specs/task.spec --code . --change-scope jj

In Git, agent-spec needs three commands (staged + unstaged + untracked) to get the complete list of changed files. In jj, only one command is needed: jj diff --name-only—because jj has no distinction for a staging area.

The BoundariesVerifier doesn't care where the file list comes from—it only checks "Are these modified files within the Contract's Allowed Changes scope?"

Stamp: Stable Change ID Tracking

In a jj environment, the stamp command additionally outputs a Spec-Change trailer:

Spec-Name: User Registration API
Spec-Passing: true
Spec-Summary: 4/4 passed
Spec-Change: kkmpptqz

kkmpptqz is jj's change ID, which remains stable across amends. Even if the Agent subsequently modifies this commit (formatting code, fixing typos), the change ID remains unchanged. The Contract → Change traceability chain does not break. Git's commit hash changes after an amend, causing this traceability chain to失效 after an amend.

Run History: Cross-Run Diff

agent-spec explain specs/task.spec --history

If the run logs between two lifecycle runs both contain jj operation IDs, agent-spec will call jj op diff to show which files the Agent modified between the two runs. This answers a valuable debugging question: "Last time it failed, this time it passed; what exactly did the Agent change in between?"

Detection, Not Configuration

agent-spec automatically detects the presence of the .jj/ directory. If present, it utilizes jj's capabilities; if not, it falls back to Git. Users do not need to declare "I am using jj" in any configuration file. In colocated repositories (where both .git/ and .jj/ exist), agent-spec prioritizes jj.

All jj interactions are invoked via std::process::Command calling the jj CLI, without linking to jj-lib. This is entirely consistent with agent-spec's integration method for Git: calling commands rather than linking libraries. If other Agent-Native VCSs emerge in the future, only a new detection branch needs to be added.

VII. Composability: agent-spec Does Not Solve All Problems

This is a boundary that needs honest discussion.

agent-spec guarantees contract compliance: whether the code meets the acceptance criteria defined in the Contract. However, "code quality" is far broader than contract compliance. A piece of code can pass all acceptance criteria yet still have race conditions, resource leaks, unreasonable abstraction levels, or fail to conform to the team's coding style.

This is not a design flaw in agent-spec; it is a conscious boundary choice. If agent-spec attempted to solve both contract verification and all dimensions of code quality simultaneously, it would become a bloated tool that does none of them deeply enough.

agent-spec's positioning is as an orchestrator. It provides a framework to turn the output of other tools into Contract acceptance criteria. Code quality improvement is achieved through composition:

Encode Quality Rules at the Spec Level

Many code quality rules can be formalized:

spec: project
name: "Project Rules"
---
## Constraints
### Must NOT
- Forbidden to use `.unwrap()` and `.expect()`
- Forbidden to use `panic!` and `todo!`
- Forbidden to use `f32` or `f64` for monetary amounts

Constraints in project.spec are inherited by all task specs. agent-spec's StructuralVerifier mechanically detects code patterns within them. This requires no additional lint tools; agent-spec itself can check "whether .unwrap() appears in the source code."

Integrate External Tools into Completion Criteria

A more powerful composition method is to turn the output of existing tools like clippy, coverage tools, and security scanners into acceptance criteria:

Scenario: Code passes clippy strict check
  Test: test_clippy_passes_with_deny_warnings
  Given the implementation is complete
  When running `cargo clippy -- -D warnings`
  Then exit code should be 0
Scenario: Test coverage above threshold
  Test: test_coverage_above_80_percent
  Given the implementation is complete
  When running coverage tool on new code
  Then line coverage should be >= 80%

Thus, the Agent not only needs to pass functional tests but also needs to pass clippy and meet coverage thresholds. agent-spec does not replace these tools. It gives them a unified framework for judging "whether standards are met."

AI Verifier Covers Residual Dimensions

Dimensions that mechanical tools cannot detect (race conditions, design rationality, security vulnerability reasoning) are heuristically checked by the AI Verifier. Its output is uncertain, not a verdict, but supplementary information for humans.

Combined: project.spec encodes general rules (L0/L1 constraints), task.spec defines task-specific acceptance criteria (L2 criteria), Completion Criteria integrate output from external tools, and the AI Verifier covers residual probabilistic dimensions. Each link uses the tool best suited for it, with agent-spec providing the composition framework.

VIII. Three-Layer Spec Inheritance

agent-spec supports three-layer Spec inheritance:

org.spec → project.spec → task.spec

org.spec (Organization Level) defines security policies and coding standards across projects. For example: "No hardcoded credentials," "All user input must be validated," "Authentication-related code must have security tests." These rules apply to all projects within the organization.

project.spec (Project Level) defines the project's technology stack decisions and conventions. For example: "Use PostgreSQL," "All APIs return structured errors," "Use thiserror for unified error types." These rules apply to all tasks within the project.

task.spec (Task Level) defines the intent, boundaries, and acceptance criteria for a single task. This is the Contract directly consumed by the Agent.

Constraints and decisions are automatically inherited downwards. The Agent in a task spec does not need to know about org.spec and project.spec. The Contract it sees has already merged constraints from all ancestor levels. However, humans can manage in layers: the security team maintains org.spec, the project lead maintains project.spec, and developers write task specs. Each performs their duty, with mechanical merging.

IX. Skills: Teaching Agents How to Work

agent-spec is not just a CLI tool. It also needs to teach Agents how to use itself. This is the role of Skills.

Installation

npx skills add ZhangHanDong/agent-spec

One command installs agent-spec Skills into the project.

Division of Labor Between Two Skills

agent-spec-tool-first is the default workflow Skill. It teaches the Agent the complete seven-step workflow: when to read the Contract, when to run lifecycle, how to read failure_summary and fix after failure, when to generate explain for humans, and when to run stamp. It also contains a critical guideline: Fix code after lifecycle failure, do not modify spec files, preventing the Agent from "cleverly" modifying acceptance criteria to force a pass.

agent-spec-authoring is the Spec writing Skill. It teaches the Agent how to write high-quality Contracts: the structure of the four elements, bilingual keywords, test binding formats, step table syntax, and the principle of "exception paths ≥ normal paths." When humans ask the Agent to help write a Contract, this Skill ensures the Agent's output conforms to best practices.

Multi-Agent Support

Skills are not just for Claude Code. The agent-spec project also includes Codex's AGENTS.md, Cursor's .cursorrules, and Aider's .aider.conf.yml. Although these files are simpler than the Claude Code Skill, they contain core command references and workflow steps, enabling different Agent tools to work with agent-spec.

The design principle of agent-spec is CLI-first, Agent-agnostic. Core functions are exposed via CLI commands, so any Agent capable of calling shell commands can use it. Skill files are the adaptation layer; the CLI is the universal layer.

X. A Real Self-Hosting Loop

agent-spec uses itself to validate itself. The project's specs/ directory contains a project.spec that defines agent-spec's own development constraints:

spec: project
name: "agent-spec Project Rules"
---
## Constraints
### Must
- Public CLI and gateway behavior must have regression tests
- DSL syntax changes must update AST, parse output, and regression tests simultaneously
- Verification results must distinguish pass, fail, skip, uncertain
...

Every commit is checked via agent-spec guard. Every new feature development has a corresponding task spec. The specs/roadmap/ directory contains the complete roadmap specs from Phase 0 to Phase 6, which are themselves Task Contracts in agent-spec format.

This is not for show—it is the best test. If agent-spec cannot use itself to manage its own development, it is not qualified to manage other projects.

XI. Current Status and Honest Boundaries

agent-spec is currently strongest in scenarios where Contracts can be checked by:

Explicit tests selected from Completion Criteria
Code pattern matching by StructuralVerifier
Path glob checks by BoundariesVerifier
Boundary verification against explicit or staged change sets

Its current limitations are also clear:

TestVerifier is Rust/Cargo specific (executed via cargo test). Non-Rust projects can use agent-spec's Contract and Boundaries capabilities, but test execution requires custom extensions.
The real backend for AI Verifier is not yet connected (stub and caller modes are available). Adversarial multi-Agent verification (Bug Finder / Skeptic / Referee tripartite game) is a reserved capability for the --adversarial flag, implementation pending the connection of a real AI backend.
The Resolver currently only inherits Constraints and Decisions, not Boundaries. If project.spec defines Forbidden: tests/golden/**, the task spec will not automatically inherit this boundary.
Quality score only measures three dimensions (determinism, testability, coverage) and does not reflect lint warnings like vague-verb, unquantified, or sycophancy. These warnings appear in the diagnostic list but do not affect the numerical score.

These limitations are known, documented, and have clear improvement paths. They do not prevent agent-spec from providing real value to Rust projects in its current state.

XII. Collaboration Models for AI Coding Teams

When a team has multiple developers and multiple AI Agents working simultaneously, the core challenge of collaboration is no longer "code conflicts" (which jj and Git can handle at the file level). The real challenge is intent conflicts: two Agents each complete their own tasks, the code compiles, tests pass, but when combined, the behavior is inconsistent because their understanding of the system is not on the same channel.

Traditional teams rely on Code Review to discover such issues. An experienced reviewer sees the diffs of two PRs, simulates the merged behavior in their head, and finds contradictions. But when the number of PRs increases from 5 per day to 50 per day, this method of human brain cross-verification becomes unfeasible.

agent-spec's three-layer Spec inheritance provides a structured coordination mechanism here.

org.spec and project.spec are the Team's Consensus Anchors

In an AI Coding team, the primary work of humans is not writing code, nor reviewing PRs one by one, but maintaining Spec layers. The Tech Lead maintains project.spec, defining the project's technology stack decisions, API conventions, and error handling norms. The security team maintains org.spec, defining security bottom lines that cannot be violated. These files are common constraints for all Agents in the team. Regardless of which Agent is executing which task, the project.spec constraints it inherits are the same.

This means that when two Agents develop in parallel, they will not diverge on technology selection (because project.spec has already decided between bcrypt or argon2), will not produce inconsistencies in API style (because project.spec has defined error code formats), and will not violate security rules (because org.spec constraints are mechanically enforced). Spec layers transform "team consensus" from implicit knowledge in human brains into explicit, machine-executable constraints.

Task Contract is the Isolation Boundary for Tasks

The Boundaries paragraph of each task spec defines the scope of files the Agent can modify. When a team allocates tasks using agent-spec, a natural practice is to ensure that the Allowed Changes of different tasks do not overlap. Or, if overlap is necessary, explicitly declare a coordination strategy for the shared area in the Contract.

agent-spec's lint --cross-check can detect such conflicts: if task-A's Allowed Changes and task-B's Allowed Changes have overlapping paths, cross-check will report a warning. This won't block development—sometimes two tasks indeed need to modify the same file—but it lets the team realize potential concurrency conflicts at the time of task allocation, rather than discovering them only when two PRs are merged.

A Typical Team Day

Imagine the daily workflow of a four-person AI Coding team.

In the morning, the Tech Lead spends 30 minutes reviewing and updating project.spec. Last week, the team decided to standardize on structured logging; this decision needs to be written into the Decisions paragraph of project.spec so that all future Agents will follow this convention.

Then, the Tech Lead creates three task specs, assigning them to three developers. Each task spec defines clear intent, technical decisions (inherited from project.spec plus task-specific ones), file boundaries (non-overlapping or consciously overlapping), and 4-6 acceptance scenarios (including exception paths).

The three developers each start their own AI Agent (could be Claude Code, Codex, or Cursor; agent-spec doesn't care). The Agent reads the Contract, codes within Boundaries, runs lifecycle verification, and automatically retries on failure. Developers can do other things while the Agent works: write the Contract for the next task, review the explain output of the previous task, or update project.spec.

Once the Agent finishes, it creates a PR. The guard in CI automatically runs mechanical verification. The Contract summary from explain is automatically posted in the PR comments. The Tech Lead performs Contract Acceptance, needing to read not the code diffs of three PRs (which might total 1500 lines), but only three explain summaries (about 30 lines each), judging whether the Contract definition is correct and if all verifications passed.

If all three PRs pass Contract Acceptance and the guard's cross-check reports no boundary conflicts, they can be merged. Throughout this process, the Tech Lead's review time shifts from "reading 1500 lines of diff" to "viewing 90 lines of Contract summary." This saves approximately 80% of review time, while actual quality assurance is stronger because each PR has undergone deterministic four-layer verification.

When the Agent is Also a Contributor

In open-source scenarios, the situation is more interesting. External contributors might use AI Agents directly to contribute code. In this case, maintainers face a trust issue: you don't know the contributor, you don't know what prompt their Agent used, how many rounds of modification occurred, or what the intermediate process looked like.

What agent-spec provides here is not "trust" but "verifiability." Maintainers do not need to trust the contributor or their Agent; they only need to check two things: whether the intent and boundaries defined in the Contract are reasonable (which humans can judge quickly), and whether all deterministic checks in the lifecycle passed (which machines have already verified). The object of trust shifts from "people" to "process." Regardless of whether the code was written by a human or an AI, if it passes the same verification pipeline, that is a sufficient condition for merging.

This is why agent-spec emphasizes in its contribution guidelines that "Contract quality is as important as code quality." If an external contributor submits a well-written Contract—clear intent, explicit decisions, sufficient exception paths, and all scenarios bound to tests—even if their code style is inconsistent with the team's, the contribution can be safely accepted. This is because correctness is guaranteed by the Contract and verification pipeline, not by the reviewer's subjective judgment.

Conclusion: Not Better Code Review, But Different Code Review

Returning to the core argument of this series.

In the second article, we said: Code Review in the Agent era should not be "humans reading more diffs"; it should become "humans defining intent, machines verifying compliance."

In the third article, we said: Version control in the Agent era should not be "Agents learning git add / git commit." It should be "VCS automatically capturing the Agent's work process."

This article provides a concrete implementation. agent-spec is not a "better Code Review tool"; it is a different paradigm. It shifts the object of review from code to contracts, moves the review timing from after coding to before coding, and changes the executor of verification from humans to machines.

The role of humans has not disappeared; it has upgraded. From "reading code to find bugs" to "defining what is right." This is a higher-value activity and a more scalable one: a good Contract can be executed and verified by infinite Agents, whereas a good reviewer can only read a limited number of diffs per day.

# Start your first Contract
cargo install agent-spec
agent-spec init --level task --name "my-first-task"

Project Address: github.com/ZhangHanDong/agent-spec^[2]

References:

[1] agent-spec: https://github.com/ZhangHanDong/agent-spec

[2] github.com/ZhangHanDong/agent-spec: https://github.com/ZhangHanDong/agent-spec