Agentic Software Engineering #6 | buffa: A Methodological Sample from Anthropic AI Coding (Rust)

buffa is not a protobuf library; it is a paradigm. It demonstrates what humans and AI should and should not do when AI writes system-level code, and what documentation and tool structures are needed to cap the quality of collaboration.

Why buffa Deserves a Separate Breakdown

Among the three Rust projects open-sourced by Anthropic, buffa is the most worthy of study for AI coding practices because it occupies a unique position: It is the only AI-written code explicitly designated for production environments.

claudes-c-compiler is a capability showcase, with its README stating, "I do not recommend you use this code." connect-rust has no AI generation declaration; its core code is human-written. Only buffa is marked with both "Written by Claude ❣️" and "Running in Anthropic production."

This means buffa's engineering practices are not academic discussions; they have been tested in production. The methodology extracted from buffa is "proven effective," not merely "sounds reasonable."

Furthermore, buffa implements a protobuf runtime + codegen. It involves complex state machines (wire format), numerous edge cases (unknown fields, enums, recursion), performance sensitivity (decode loops), and high API design requirements (Rust type system + ergonomics), requiring long-term maintenance. This is precisely the type of code where AI is most prone to failure. buffa did not fail.

I. Constraints First: `DESIGN.md` as the AI's Decision Framework

1.1 Eight Design Principles = Hard Constraint Set for AI

The beginning of buffa's DESIGN.md lists eight design principles:

1. Pure Rust, zero C dependencies.
2. Editions-first.
3. Correct by default.
4. Idiomatic Rust API.
5. Zero-copy read path.
6. Linear-time serialization.
7. no_std capable.
8. Descriptor-centric.

These principles are not project visions for humans to read; they are decision arbitration rules for the AI. When the AI faces a conflict between two implementation schemes, these principles provide immediate arbitration based on implicit priority:

"Should we introduce a C dependency to speed up UTF-8 validation?" → Rejected by Principle 1.
"Who wins when zero-copy conflicts with correctness?" → Principle 3 ranks before Principle 5; correctness wins.
"Should we add proto parsing functionality to buffa?" → Principle 8 says to use protoc/buf descriptor, do not parse it yourself.

AI fears ambiguous requirements most and excels at optimization within constraints. These eight principles narrow buffa's design space from "infinite possibilities" to a finite space where AI can search efficiently.

1.2 Competitor Comparison Table = Preventing AI from Suggesting "Use Existing Libraries"

DESIGN.md starts with a competitor comparison table:

| Library | Pure Rust | Editions | Maintained |
|----------------|-----------|----------|------------|
| prost v0.13 | Yes | No | Passive |
| protobuf v4 | No (upb) | Yes | Active |
| rust-protobuf | Yes | No | Maintenance|

The function of this table is not market research for humans; it is a decision anchor for the AI. Without this table, when asked to "investigate whether we can fork prost to support editions," the AI might spend大量 tokens analyzing. With this table, the answer is instant: prost is passively maintained and does not support editions; forking is not feasible. This table is loaded at the start of every Claude Code session, eliminating a whole class of无效 discussions like "why not use X."

1.3 Module Boundaries = Telling AI Which File to Modify

About 30% of DESIGN.md is dedicated to describing the boundaries of each crate: what it does, what it doesn't do, and which code resides in which file:

WKT wire format is in src/generated/; JSON and stdlib conversions are in *_ext.rs.

The value of this statement lies not in telling humans that "WKT is pre-generated" (humans can see the src/generated/ directory), but in telling the AI: go to generated/ to modify the wire encoding of Timestamp, and go to timestamp_ext.rs to modify RFC3339 formatting. In a project with dozens of files, one of the most common mistakes AI makes is modifying the wrong file. Describing module boundaries reduces this error rate to near zero.

II. Parameterization Over Branching: Shifting Complexity Upstream to Codegen

2.1 Editions as the Core Abstraction

There are no if proto2 {} else if proto3 {} branches inside buffa's codegen. All behavioral differences stem from different value combinations of resolved features:

proto2 file → proto2 feature defaults
proto3 file → proto3 feature defaults
edition N → edition N defaults + overrides

The codegen only looks at resolved features (FieldPresence::Explicit vs Implicit, EnumType::Open vs Closed), not syntax versions.

The implications for AI coding are profound:

Bad Pattern	Good Pattern
AI writes `if/else` runtime branches	AI writes codegen to generate static code
AI manages complex state judgments	AI compiles state into code
Dynamic behavior	Static expansion

Let AI generate "deterministic code"; do not let AI write runtime branching logic. This is because the correctness of branching logic depends on understanding all branch combinations, whereas the correctness of static expansion depends only on the local correctness of each expanded instance—the latter is where AI excels.

2.2 Descriptor-centric = Not Letting AI Do Parsing

buffa does not write its own .proto parser. It relies on protoc/buf to parse .proto files into FileDescriptorProto (a structured protobuf message), and then the codegen directly consumes this structured input.

This is a key capability boundary judgment: Parsing .proto files involves ambiguous syntax boundaries, complex import resolution, and feature inheritance chains. These are areas where AI is prone to errors. However, generating Rust code from a parsed descriptor is highly structured: clear input, clear output, and automatable verification.

buffa assigns the "part AI is not good at" (parsing) to existing reliable tools (protoc) and keeps the "part AI is good at" (generating code from structured input) for itself. This is not laziness; it is precise capability allocation.

III. Dual-Type System: A Structural Template for AI Writing Performance Code

3.1 Separation of Owned + View

buffa generates two Rust types for every protobuf message:

// Owned: Heap allocated, used for building and modifying
pub struct Person {
    pub name: String,
    pub id: i32,
    pub address: buffa::MessageField<Address>,
}
// View: Zero-copy, used for high-performance reading
pub struct PersonView<'a> {
    pub name: &'a str,
    pub id: i32,
    pub address: buffa::MessageFieldView<AddressView<'a>>,
}

If you ask AI to directly write a type that is "both easy to use and fast," it will struggle, as "easy to use" (String) and "fast" (&'a str) are contradictory in Rust. But if you give the AI a structural template—"generate two types for each message, one owned and one borrowed"—the AI can do the right thing within each type.

This is a structural prompt: instead of telling the AI "write faster code," you give it an architectural pattern, allowing performance to emerge naturally from the structure.

3.2 OwnedView: A Self-Referential Wrapper Crossing Async Boundaries

The lifetime 'a of PersonView<'a> prevents it from satisfying the 'static bound, yet Tower services, tokio::spawn, and BoxFuture<'static, _> all require 'static. OwnedView<V> binds the Bytes buffer and the decoded view together, using transmute to extend the lifetime to 'static.

The safety argument for this transmute in DESIGN.md is threefold:

Bytes is reference-counted; the heap data pointer remains stable after a move.
Bytes is immutable; the data borrowed by the view will not be modified.
Manual Drop implementation ensures the view is always dropped before the buffer.

This argument is not for human code reviewers; it is for the next AI session modifying OwnedView. When modifying any related code, the AI must re-verify whether these three conditions still hold. If one only writes "used transmute" without explaining "why it is sound," the AI might break these invariants in future modifications without realizing it.

IV. Decision Evolution History: Preventing AI Regression

4.1 `AtomicU32` vs `Cell<u32>`: The Path Already Traveled

The most valuable content in DESIGN.md is the history of decision evolution. Taking CachedSize as an example:

An earlier design used Cell<u32> on the assumption that avoiding atomics would be faster... In practice, Relaxed-ordered atomic load/store compiles to identical machine instructions as a plain memory access on every major platform... Switching to AtomicU32 makes messages Sync, enabling Arc<Message>...

The structure is:

We previously used Cell<u32> (seemed more efficient).
The reason was "to avoid atomic overhead."
But in reality, Relaxed is zero-cost on all major platforms.
After switching to AtomicU32, messages became Sync (can be placed in Arc).
Furthermore, the DefaultInstance trait requires Sync—using Cell would directly cause a compilation failure.

The core function of this passage is to prevent regression. Without this history, a new Claude session seeing AtomicU32 might "optimize" it back to Cell<u32>, thinking serialization is single-threaded and atomics seem wasteful. But this passage tells it: this path has been walked and led to a dead end; it's not just a performance issue, it causes compilation failures.

4.2 Rejected Pre-scan Scheme: A Data-Backed "Do Not Do"

Two approaches were benchmarked:

Per-field scanning: 20-97% regressions

Single-pass counting: 5-40% regressions

Vec's doubling strategy produces at most log2(n) allocations, and for typical protobuf maps/repeated fields (2-20 entries), that's only 2-5 allocations — cheaper than a full buffer scan.

Pre-scanning seems like an "obviously correct" optimization: count the elements first, then allocate once. Any experienced programmer (or AI) might propose this. But the data says no: both implementations were slower.

This record of "rejected solutions + benchmark data" is designed specifically for AI collaboration. Traditional documentation does not record "what we didn't do," but for AI, "what not to do" is as important as, if not more important than, "what to do," because AI has no memory of "we tried this last time."

4.3 Investigation Logs: The Complete Thought Process

docs/investigations/e0477-owned-view-send/ is a complete investigation log: it tried trait_variant, RTN, various drop() variants, and finally found the issue was at the signature level, not in the function body. This file records the thought process, not just the final conclusion.

If a future AI session wants to "optimize" OwnedView's Send impl or add back the V: 'static bound, this investigation log will prevent regression. "We tried it; the problem lies in the Rust compiler's RPITIT desugaring bug (rust-lang/rust#128095), not something solvable at the code level."

V. Review Agent: Separation of Doing and Evaluating

5.1 Design of `rust-code-reviewer.md`

buffa's CLAUDE.md instructs Claude to run the rust-code-reviewer agent before submission. The metadata of this agent reveals three key design decisions:

model: opus: Use the strongest model for review, and Sonnet for coding. Judgment is more valuable than generation speed.

tools: Read, Glob, Grep: Read-only permissions. The reviewer cannot modify code. This enforces separation of duties, avoiding the closed loop of "modifying code and then approving it oneself."

16 Review Dimensions: Ranging from API Design to Observability, covering the full spectrum of Rust engineering. API Design ranks first (usability is the most important quality dimension), and Unsafe ranks eighth (safety is a result of good design, not an independent goal).

5.2 Why Not Let AI Self-Review?

Anthropic's harness design blog discovered: When AI evaluates its own work, it systematically gives overly high scores. Tuning an independent evaluator to be picky is far easier than making the generator self-criticize.

buffa's reviewer agent is the engineering implementation of this finding. The coding Claude (possibly Sonnet) writes the code, and the reviewing Claude (Opus, read-only) evaluates it. The two are completely separated.

5.3 Two-Layer Documentation Overlay = Complete Review

rust-code-reviewer.md provides general Rust knowledge (16 dimensions), while CLAUDE.md provides project-specific rules ("regenerate types if codegen is modified"). The overlay of these two layers produces a review that has both depth and project awareness. This is a reusable pattern: take rust-code-reviewer.md to your own Rust project, and you only need to write your own CLAUDE.md.

VI. Code Comments Oriented Towards Future AI Collaborators

6.1 "Why" Comments > "What" Comments

The comment style in buffa's source code differs significantly from ordinary open-source projects:

// An unbounded `loop` is used intentionally: a bounded
// `for _ in 0..10` adds loop-counter overhead that LLVM cannot eliminate

This does not explain "this is a loop"; it explains "why we don't use a bounded loop that looks safer"—because LLVM cannot prove the loop definitely ends within 10 iterations, retaining loop counter overhead, which causes encoding throughput to plummet by 40%. It prevents AI from introducing performance regressions during future "improvements."

6.2 Inline Safety Arguments

Every unsafe block has a // SAFETY: comment. It is not a formal one-liner but a complete argumentation chain:

// SAFETY: `Bytes` is StableDeref — its heap data never moves or is
// freed while we hold the `Bytes` value. We hold it in `self.bytes`,
// and drop order guarantees `view` drops first.

Three conditions (StableDeref + holding relationship + drop order), none missing. This style feels more like a format required by a reviewer—AI's spontaneous SAFETY comments tend to be nonsense like "safe because we know it's correct."

6.3 Test Names as Specifications

fn explicit_presence_with_zero_value()
fn has_extension_returns_false_on_extendee_mismatch()
fn extension_or_default_zero_is_present_not_default()

These test names are specifications themselves. When modifying code, AI can understand expected behavior from test names without needing extra specification documents.

VII. The Taste Boundary of Performance Engineering

7.1 Intentional Performance Overhead

DESIGN.md lists the reason for every performance overhead:

Overhead	Reason	Can it be optimized away?
Unknown field Vec::push	Round-trip fidelity	Should not—this is a core feature
EnumValue wrapper	Type-safe open enum	Should not—otherwise degrades to prost's i32
Recursion depth check	Support recursive message types	Should not—otherwise triggers E0275
Box per nested message	Standard Rust ownership	Could use arena, but violates Principle 4

This table is a "do not optimize away" list for the AI. The AI's first reaction might be to eliminate these overheads, but the table explicitly states: each item is intentional, corresponding to a non-negotiable feature.

7.2 The Readability Red Line

Readability line we hold: fast-path/slow-path splits with a "why" comment are fine. Manual unrolling, #[inline(always)] sprinkled defensively, SIMD intrinsics, or likely()/unlikely() workarounds are not. The test: can a new contributor read the code, understand the fast path, and safely modify the slow path?

This is the philosophy of performance engineering in the AI era: Reasonability > Extreme Micro-optimization.

AI naturally tends towards aggressive optimization (it doesn't care about readability); this rule pulls the AI back into the "human-maintainable" range. The boundary between allowed optimizations (fast-path separation + why comments) and disallowed optimizations (manual unrolling, SIMD, likely/unlikely) is precise.

This deserves further elaboration.

After recording three profile-guided optimizations, buffa's DESIGN.md specifically adds this "readability red line" declaration. It is not a vague code style guide; it explicitly draws the line on what optimizations cannot be done after concretely demonstrating what optimizations are permissible.

This order is important. First, look at what it allows.

Allowed: Three Profile-Guided Optimizations

These three optimizations all come from pprof data during the connect-rust integration process (LogRecord view-decode benchmark, approx. 350 string fields and 450 varints per request). Each is a "small, commented, readability-preserving" change.

Optimization 1: Unbounded Loop in encode_varint

// Before (after some refactoring)
for _ in 0..10 {
    if value < 0x80 {
        buf.put_u8(value as u8);
        return;
    }
    buf.put_u8((value as u8 & 0x7F) | 0x80);
    value >>= 7;
}
// After (restored to unbounded loop)
loop {
    if value < 0x80 {
        buf.put_u8(value as u8);
        return;
    }
    buf.put_u8((value as u8 & 0x7F) | 0x80);
    value >>= 7;
}

A certain refactoring changed loop to for _ in 0..10, which looked "safer" because it had explicit bounds. However, LLVM could not prove that the internal return would definitely trigger within 10 iterations, so it retained the loop counter maintenance overhead (comparison, increment, conditional jump). Since value >>= 7 monotonically decreases, termination is mathematically guaranteed; the unbounded loop allows LLVM to see this and generate more compact machine code.

Impact: **~40% recovery in encoding throughput**.

This optimization fits the red line: it is a fast-path choice (unbounded vs. bounded loop), has a complete "why" comment explaining why the seemingly safer写法 is not used, and new contributors can read and safely modify it.

The significance for AI is particularly great: AI is very likely to "improve" this code. Changing loop to for _ in 0..10 (because bounded loops "look safer"). The record in DESIGN.md directly prevents this regression.

Optimization 2: Single-Byte Fast Path in Tag::decode

In protobuf, field number 1-15 plus any wire type encodes to a single byte. This is the most common case (protobuf style guides recommend placing high-frequency fields in this range). The original code went through the generic decode_varint path. Although decode_varint also has a single-byte fast path internally, because the #[inline] hint was not strong enough, LLVM often failed to inline it into the per-field decode loop.

The solution was to explicitly add a single-byte check in Tag::decode:

// Fast path: field 1-15 only needs one byte
if buf.remaining() > 0 {
    let byte = buf.chunk()[0];
    if byte < 0x80 {
        buf.advance(1);
        return Ok(Tag { field_number: (byte >> 3) as u32, wire_type: ... });
    }
}
// Slow path: multi-byte varint
let v = decode_varint(buf)?;

Impact: **+12-29% view decode, +9-16% owned decode**.

This also fits the red line: fast-path/slow-path separation, with a "why" comment, and the slow path is completely unaffected. New contributors can safely modify the slow path without understanding the optimization motivation of the fast path.

Optimization 3: strict_utf8_mapping Opt-in

pprof showed that core::str::from_utf8 accounted for 11% of decode CPU. Rust's &str type has a compile-time UTF-8 invariant; one cannot skip validation and still maintain the &str type.

The solution was not to skip validation (which would break Rust type safety), but to provide an option at the codegen level: when a proto's utf8_validation = NONE, map string fields to Vec<u8> / &[u8] instead of String / &str. The caller chooses whether to use from_utf8 (check) or from_utf8_unchecked (trust input).

Impact: **~2× RPS** in trusted input services of connect-rust.

The brilliance of this optimization lies in: it is not a runtime optimization (no code in the decode loop was changed), but a type selection at the codegen level. By changing the type signature of the generated code (&str → &[u8]), the decision of "whether to validate UTF-8" is moved from runtime to compile time.

Moreover, it is default-off; proto2's default is NONE, and enabling it automatically would break the types of all proto2 string fields. This prudence itself reflects the readability red line: no default aggressive optimization; require explicit user opt-in.

Disallowed: Optimizations Across the Red Line

DESIGN.md explicitly lists four categories of prohibited optimization methods:

Manual Unrolling: Expanding for i in 0..4 { process(data[i]); } into four lines of process(data[0]); process(data[1]); .... While this can indeed reduce branch prediction overhead in the inner loop of protobuf decode, the expanded code loses the semantic meaning of "this is a loop." New contributors seeing four lines of repeated code won't know they are an expansion of the same logic and might only modify one line when making changes.

Defensive #[inline(always)]. Rust's #[inline] is a hint; #[inline(always)] is a mandate. Sprinkling #[inline(always)] everywhere on the decode hot path can indeed avoid function call overhead, but the cost is code bloat (function body expansion at every call site) and increased compile time. More importantly, it conveys an attitude of "I don't trust the compiler." This is reasonable in human-maintained code (humans might be more accurate than LLVM's judgment), but dangerous in AI-maintained code because the AI does not understand the context of "why this function needs always inline while that one doesn't."

SIMD Intrinsics: Using SSE/AVX instructions like _mm_cmpeq_epi8 for varint decoding or UTF-8 validation. Performance gains are significant (simdjson follows this route), but SIMD code is essentially "writing assembly in Rust syntax," modifiable only by those who understand the target architecture's instruction set. This also conflicts with buffa's design principle 1 (Pure Rust, zero C dependencies). Although SIMD intrinsics are Rust code, their behavior is inconsistent in no_std and cross-platform scenarios.

likely()/unlikely() Workarounds. Rust stable does not have likely/unlikely attributes (nightly has #[cold]), so the community invented various workarounds (wrapping slow paths in #[cold] functions, using core::hint::black_box to trick optimizers, etc.). These tricks rely on knowledge of LLVM's internal behavior and may fail once the LLVM version upgrades—complete black magic for AI.

The Judgment Standard of the Red Line

DESIGN.md provides a precise test:

can a new contributor read the code, understand the fast path, and safely modify the slow path?

Three conditions, none missing:

read the code. Is the code's intent understandable upon reading, without needing to consult LLVM documentation or CPU manuals?
understand the fast path. Is the trigger condition for the fast path obvious ("first byte < 0x80 means single-byte varint")?
safely modify the slow path. Will modifying the slow path accidentally break the assumptions of the fast path?

All three allowed optimizations pass this test. Unbounded loop: reading the code shows value >>= 7 must terminate. Single-byte fast path: the meaning of byte < 0x80 is obvious to anyone familiar with varint encoding. UTF-8 type mapping: changes the type signature of codegen output; runtime code is unaffected.

The four prohibited optimizations fail. Manual unrolling: doesn't know the rationale for the unroll factor. Always inline: doesn't know why certain functions need forced inlining. SIMD: doesn't know the semantics of instructions and platform constraints. Likely/unlikely: doesn't know the compiler behavior the workaround relies on.

What This Red Line Means for AI Coding

Returning to the core question. buffa is code written by Claude. Claude's optimization tendencies differ from humans:

Humans tend to under-optimize. Human engineers usually write readable code first, perform targeted optimizations only after profiling shows bottlenecks, and hesitate ("will this change make it harder to maintain?").

AI tends to over-optimize. AI has no intuition for "this change will be hard to maintain later." If you ask it to "optimize decode performance," it might indiscriminately add #[inline(always)], unroll loops, and insert SIMD intrinsics, because these appear frequently in "high-performance code" samples in its training data.

The true function of this red line is: when AI is asked to optimize performance, tell it "to what extent optimization should stop." Without this red line, AI might turn buffa's decode loop into a pile of SIMD intrinsics; performance would indeed be faster, but the next AI session (or human maintainer) would no longer understand or be able to modify it.

This also explains why buffa's performance (view decode 1,772 MiB/s, 156% faster than prost) is already very good, but still not as good as Google's upb (C implementation with SIMD acceleration). buffa chose to achieve the best possible performance within the red line rather than breaking the red line to pursue the extreme. This is a conscious trade-off: AI maintainability > the last 20% of performance.

In the era of AI coding, this may be a more important design question than "how to write faster code": how much performance are you willing to sacrifice for maintainability? buffa's answer is: sacrifice optimizations at the level of SIMD and manual unrolling, but retain optimizations at the level of fast-path separation and codegen type selection. This line is drawn very precisely.

VIII. Infrastructure Constraints, Not Prompt Preaching

8.1 Two Mandatory Rules in CLAUDE.md

buffa's CLAUDE.md has only two core rules:

Rule One: After modifying codegen output, checked-in code must be regenerated (task gen-wkt-types), or CI fails.

Rule Two: Before submission, the rust-code-reviewer agent must be run, and all Critical/High/Medium findings must be resolved.

These are not "suggestions"; CI checks the first, and the reviewer agent checks the second. This is infrastructure constraint: using tools to enforce AI behavior, rather than relying on "please pay attention to quality" in prompts.

8.2 The Rust Compiler Itself = The Zeroth Review

buffa chose Rust not just for performance. Rust's type checker acts as an unbypassable quality gate when AI generates code:

Memory safety issues → Compilation fails
Lifetime errors → Compilation fails
Send/Sync violations → Compilation fails

The zero unsafe blocks in claudes-c-compiler prove this point. The type checker cannot be persuaded, is not lenient, and does not think "close enough is fine." For the few places in buffa that must use unsafe (the transmute in OwnedView), the complete safety argument in DESIGN.md ensures the AI will not break invariants in future modifications.

8.3 Conformance Tests = Ultimate Verification

buffa passed the full suite of Google's protobuf binary and JSON conformance test suite. This test suite was not written by the buffa team; it is maintained by the protobuf 官方, an authoritative verification covering all edge cases.

This means: whether the AI-generated encoding/decoding code is correct does not depend on the AI's self-evaluation or human code review, but on an external, independent, authoritative verification system. This is the purest embodiment of the principle "test quality determines code quality."

IX. Restoring the Real Workflow

Combining all evidence, buffa's development process is likely as follows:

Step 1: Human (McGinniss) defines design constraints → Writes DESIGN.md design principles, competitor comparison, module boundaries.

Step 2: AI (Claude) generates codegen + runtime code → Implements encoding/decoding logic within the constraint space.

Step 3: Automated verification → Conformance test suite checks correctness.

Step 4: AI reviews AI → rust-code-reviewer agent (Opus, read-only) reviews code quality.

Step 5: Human reviews trade-offs → Decides AtomicU32 vs Cell, whether to pre-scan, etc.

Step 6: AI iterates optimization → Performs profile-guided optimization based on pprof data.

Step 7: Human updates DESIGN.md → Records decision evolution, rejected schemes, performance causality.

Step 8: Loop Step 2-7.

Note the division of labor between human and AI: Humans define and narrow the design space (Step 1, 5, 7), while AI implements and optimizes within the defined space (Step 2, 4, 6). Verification (Step 3) is fully automated, relying on neither party's judgment.

X. Five Transferable Practices

AI coding practices distilled from buffa can be directly migrated to any project requiring AI to write "long-term maintenance code":

Practice 1: Write `DESIGN.md` Before Writing Code

Not retroactively adding documentation, but before the AI starts, humans define: design principles (with priority sorting), module boundaries (which file does what), and competitor analysis (why not use existing solutions). This file is the AI's decision framework, not background introduction for humans.

Practice 2: Write Decision History and Rejected Schemes into DESIGN.md

Traditional documentation records "what we did"; AI collaborative documentation must also record "why we didn't do others." Every rejected scheme must include benchmark data; "pre-scan scheme regressed performance by 20-97%" is ten thousand times more useful than "pre-scan scheme is bad."

Practice 3: Review Agent Uses Strongest Model + Read-Only Permissions

Use Sonnet (fast, cheap) for generating code, and Opus (accurate, expensive) for reviewing code. The review agent cannot modify code, only report, enforcing separation of duties.

Practice 4: Let AI Write Code Generated by Codegen, Not Runtime Branching Logic

Shift complexity upstream to the code generation phase. AI generates deterministic code from structured input (descriptor, spec, schema); runtime performs no complex judgments. This aligns perfectly with the paradigms of Spec-driven development and DSL → Code.

Practice 5: Validate with External Authoritative Test Suites, Not AI Self-Evaluation

buffa uses Google's conformance test, claudes-c-compiler uses GCC torture test. AI-written code is only trustworthy when the verification system is complete and independent of the AI itself.

Final Judgment

The reason buffa can achieve "AI-written code in production" is not because Claude is particularly strong, but because a complete structure of constraints, feedback, and verification was built around Claude:

DESIGN.md constrains the decision space
rust-code-reviewer provides external review
conformance test provides authoritative verification
Rust compiler provides type safety guarantees
CLAUDE.md ties these into a mandatory process

Remove any layer, and buffa's code quality would significantly degrade. Model capability is a necessary condition, but far from sufficient. The sufficient condition is the constraint structure designed by humans.

This is the true value of buffa as a paradigm of "AI writing infrastructure code"—it is not demonstrating "how good code AI can write," but rather "what kind of structure humans need to build to enable AI to write sufficiently good code."

Appendix

Appendix A: buffa-design-annotated.md — Line-by-line Chinese-English annotated interpretation Gist of buffa DESIGN.md^[1]

Appendix B: rust-code-reviewer-annotated.md — Line-by-line Chinese-English annotated interpretation Gist of connect-rust rust-code-reviewer.md^[2]

[1] buffa-design-annotated.md — buffa DESIGN.md line-by-line Chinese-English annotated interpretation Gist: https://gist.github.com/ZhangHanDong/f4ac670f2fdd939bc4355dad93df92b0

[2] rust-code-reviewer-annotated.md — connect-rust rust-code-reviewer.md line-by-line Chinese-English annotated interpretation Gist: https://gist.github.com/ZhangHanDong/ebc9577991d0a5ce94f1d91c5d64fe40