In early 2026, Anthropic publicly released a series of Rust projects and engineering experiences in rapid succession:
In February, they released a C compiler (
claudes-c-compiler) built by 16 parallel Claude agents.In March, they simultaneously open-sourced the protobuf library
buffaand the RPC frameworkconnect-rust[1], while publishing two in-depth engineering blogs: one on the agent team experience for the C compiler, titled Building a C compiler with a team of parallel Claudes[2], and another on harness design for long-running applications, titled Harness design for long-running application development[3]. Additionally, Anthropic forked two critical Rust infrastructure libraries: the async runtimeanthropics/tokio[4] and the concurrent cacheanthropics/moka[5], making targeted modifications that exposed their real technical demands in production environments.
While these materials may appear independent on the surface, a comprehensive analysis reveals they encode a complete, reusable methodology for AI coding. This article starts with Anthropic's Rust application layout, deconstructs these five projects and two blogs layer by layer, and ultimately distills systematic best practices for AI coding.
I. Why Anthropic Uses Rust
Anthropic is an AI company, but the weight of Rust in its engineering system far exceeds external perception.
A job posting for Anthropic's Build Systems team (March 2026) revealed a critical detail: the company's monorepo spans three languages: Python, Rust, and Go, with build targets covering multiple accelerator platforms including TPU, Trainium, and GPU. The position explicitly required "familiarity with the Rust build toolchain (cargo, maturin) and Python-Rust interoperability."
This indicates that Rust is the "load-bearing wall" of Anthropic's systems.
By analyzing over 14 job openings as of March 2026, Rust appears in at least the following teams:
Inference Services: Serving Claude inference for millions of users, including intelligent routing across thousands of accelerators and LLM inference optimization. Build Systems: Rust is a core required language, used for sealing build infrastructure and remote build execution. Sandbox (two separate roles): Secure sandbox execution environments, kernel optimization, virtualization, and lightweight VM solutions. Security Engineering (multiple roles in London, San Francisco, Seattle): Authentication architecture, cryptographic foundations, CI/CD hardening, and experimental security R&D in the "Security Lab." Agent Infrastructure: Autonomous agent execution environments, state management, and safety boundaries.
Additionally, Rust is present in teams for Inference Deployment, Reinforcement Learning Research (Horizons), Interpretability Research, and Core Infrastructure.
Five Open-Source Rust Repositories: Three Original Projects + Two Deep Forks
Anthropic's GitHub organization currently hosts three original Rust repositories and two forks with substantial modifications:
.claude/agents | |||
The three original projects constitute an AI participation gradient (100% AI → AI-led → Human-led), while the two forks expose Anthropic's deep-water production requirements—runtime-level issues only encountered in high-concurrency production services.
II. Technical Interpretation of Five Anthropic Open-Source Rust Projects
claudes-c-compiler: Stress Test and Capability Showcase of 100% AI Autonomous Coding
In February 2026, Anthropic security researcher Nicholas Carlini used 16 parallel Claude Opus 4.6 agents to build a Rust C compiler from scratch. Over two weeks, approximately 2,000 Claude Code sessions cost about $20,000 and produced 100,000 lines of Rust code.
The claudes-c-compiler supports x86-64, i686, AArch64, and RISC-V backends, includes a built-in assembler and linker, and can compile the unmodified Linux 6.9 kernel, joining the ranks of GCC, Clang/LLVM, and Intel oneAPI. It can also compile QEMU, FFmpeg, SQLite, PostgreSQL, and Redis, passing 99% of the GCC torture test cases. It also passed the developer's ultimate litmus test: it can compile and run Doom. Anthropic used compiling and running Doom to prove that AI-generated code can not only compile but also correctly execute complex real-time logic at runtime.
This is an interesting tradition in programmer culture. Released by id Software in 1993, Doom was one of the first true 3D first-person shooters. The engine written by John Carmack was extremely efficient for its time, running smoothly on 386 processors, with source code open-sourced under GPL in 1997. Since then, "Can it run Doom?" has become a semi-joking technical validation standard. Community members have run Doom on printers, pregnancy test screens, ATMs, and in Minecraft. It has become a meme, but it carries serious technical implications. Although Doom's source code is only about 40,000 lines of C, it thoroughly tests a compiler: it covers a vast array of C language features, offers immediate correctness verification, has long dependency chains, and imposes actual performance pressure.
This is a clean-room implementation; Claude had no internet access during development, relying solely on the Rust standard library. The code contains zero unsafe blocks.
However, it has clear limitations: it lacks a 16-bit x86 compiler (needed to bootstrap Linux which calls GCC), the generated machine code is less efficient than GCC output with all optimizations disabled, and the Rust code quality is "reasonable but far from expert-level." Carlini admitted, "The compiler is approaching the limits of Opus's capabilities."
The architecture is extremely simple, an infinite Bash loop:
while true; do
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-X-Y > "$LOGFILE" 2>&1
doneSixteen Docker containers each run a loop, synchronizing via git. There is no orchestrator agent or complex orchestration framework. The coordination mechanism is file locking: agents write placeholder files in the current_tasks/ directory, and git's conflict mechanism naturally resolves contention.
Although this project was mocked by the technical community after its release, it is worth noting that it serves only as a capability showcase for Opus 4.6, not as a production-grade replacement. Obviously, it cannot replace GCC. I previously referred to this project as an "actor."
If we view this project from the correct angle, we can learn from it, turning noise into signal.
Looking at the claudes-c-compiler source code, I was curious: why is there no CLAUDE.md?
I then realized that CLAUDE.md is an operation manual for humans to guide Claude, suitable for human-machine collaboration modes.
The mode of claudes-c-compiler is entirely different; it is autonomous. Returning to this Bash loop (running only in containers):
#!/bin/bash
while true; do
COMMIT=$(git rev-parse --short=6 HEAD)
LOGFILE="agent_logs/agent_${COMMIT}.log"
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-X-Y > "$LOGFILE" 2>&1
doneThis Bash code is essentially a basic form of a "Ralph Loop." The "Ralph loop" (named after Ralph Wiggum from The Simpsons, an innocent and tireless character) is a method popular in the Claude Code community for keeping AI agents working autonomously.
--dangerously-skip-permissions here means skipping all permission confirmations. While a Ralph loop running on a personal machine might still have occasional human oversight, Carlini's version is true "set it and forget it," necessitating the skip of permission prompts (otherwise Claude would stall on confirmations like "Allow running cargo build?"), with the trade-off being container isolation for safety.
Carlini also mentioned a funny "Claude Code suicide" incident in his blog: Claude accidentally executed pkill -9 bash, killing its own bash process and terminating the loop.
In each loop, Claude is thrown into a fresh Docker container, reads AGENT_PROMPT.md, and runs autonomously until a task is complete. There is no human interaction, no concept of "pre-commit checks"; Claude decides what to do, how to do it, and when to push.
In this mode, the two functions of CLAUDE.md are replaced by other mechanisms:
Operational Instructions: Undertaken by AGENT_PROMPT.md(which contains instructions like "break problems into small pieces, track progress, persist until perfect"; not open-sourced but described in the blog).Long-term Memory: Undertaken by the filesystem maintained by Claude itself.
To enable concurrent execution, a directory current_tasks/ is used as a distributed task lock. This directory serves as the coordination bus for the 16 agents. Each .txt file is a task lock:
current_tasks/fix_arm_asm_caspal_instruction.txt
current_tasks/fix_x86_standalone_kernel_link_errors.txt
current_tasks/fix_macro_param_prefix_substitution.txt
...The structure of each file is very consistent. Looking at the content, you'll find that Claude invented a semi-structured format itself:
Fix ARM assembler: add support for CASP/CASPA/CASPL/CASPAL instructions
[Problem Description]
[Technical Details: Instruction Encoding Format]
Files to modify: [List of specific files]
Started: 2026-02-05
Locked by: [commit hash] at [timestamp]Although not every file strictly follows this structure, this format exposes several important AI coding practices:
Problem Description as Specification: Each task file is not just a title but contains a complete bug analysis. For instance, fix_arm_asm_quad_prel64_relocation.txtlists three interrelated bugs (parser not decomposing symbol+offset, ELF writer using wrong relocation type, missing Prel64 variant) along with specific kernel code examples (the.quadinstruction in the__jump_tablesection). This allows the next agent picking up the task to skip re-analyzing the problem.Pre-declaration of Modification Scope: The "Files to modify" list. This isn't just documentation; it's a signal to other agents: "I'm modifying these files, stay away." In a scenario with 16 parallel agents, this is far more efficient than handling git conflicts after the fact. Locking Mechanism: Locked by: [commit hash] at [timestamp]is the actual appearance of the file lock described by Carlini in his blog. Note it uses the git commit hash, not a PID or hostname. Since each agent is in an independent Docker container, PIDs are meaningless.
Claude also maintains a knowledge base or memory: ideas/.
The ideas/ directory is the most valuable sample of AI coding practice in the entire project. It was created and maintained entirely by Claude, functionally equivalent to buffa's DESIGN.md, but it is dynamically generated at runtime.
Subsequent content will introduce buffa.
ideas contains project status tracking (new_projects.txt / new_projects_myasm.txt).
These two files combined exceed 400 lines, tracking compilation test results for over 150 open-source projects. The format is extremely standardized:
redis: PASS (all 4 backends: x86, i686, arm, riscv)
zstd: PASS x86, FAIL arm (runtime crash in fullbench)new_projects_myasm.txt is even more granular, with four columns per project (x86/i686/arm/riscv), where FAIL entries include specific causes and fix records. For example, the zstd entry records the root cause of the RISC-V failure: "get_expr_type returns U64 for UIntLiteral on 64-bit targets (should be U32), causing the narrow pass to miss LShr narrowing → sign-extended result of mulw fed into 64-bit srl → DeBruijn array out-of-bounds segfault."
This is not a bug report written by a human; it is an investigation record left by Claude itself during the bug-fixing process. Its function is identical to the "Rejected Solutions" chapter in buffa's DESIGN.md: preventing future agents from stepping into the same pit.
Files under ideas/ are categorized by HIGH/MEDIUM/LOW priority:
high_codegen_runtime_perf.txt — Runtime performance bottlenecks (with profiling data)
high_compile_speed_improvements.txt — Compilation speed bottlenecks (with callgrind instruction counts)
high_sema_expansion_typed_ast.txt — Type system refactoring plan (7 steps, 5 completed)
low_structured_error_infrastructure.txt — Error reporting infrastructureThe internal structure of each file is astonishing. Taking high_compile_speed_improvements.txt as an example:
Profiled on sqlite3.c (callgrind, 20.2B instructions after fixes).
Previous profile: kernel/softirq.c (2.31B instructions).
FIXED: Peephole loop trampoline O(n*labels) quadratic scan was 19.76%
of total compile time. Pre-built reverse index reduces it to 0.21%.
Overall 22.7% instruction count reduction (26.1B -> 20.2B).Claude used callgrind for profiling, discovered that an O(n×labels) quadratic scan accounted for 19.76% of total compile time, and reduced the overall instruction count by 22.7% after fixing it. Then it recorded the data both before and after the fix. This is isomorphic to the performance causal analysis in buffa's DESIGN.md.
I suspect buffa's AI practices also originated from this project.
docs_verified_2026_01_29.txt is one of the most astonishing files: Claude audited its own documentation, recording inaccuracies found in each README and their fixes:
Accuracy Audit (2026-02-03):
src/frontend/README.md:
- Fixed `$` in identifiers claim (always permitted, not gated by gnu_extensions)
- Fixed Token described as tuple -> struct with named fields
- Fixed Parser::new() signature (takes only Vec<Token>)This directly echoes Carlini's warning in his blog: "The docs may be wrong and make claims that are false." Claude realized its documentation might be inaccurate, so it built an audit mechanism. However, note that this audit was performed by Claude itself, without an external evaluator. This is why Carlini still says "docs may be wrong"; self-audit reliability is limited, which is the same issue as "AI self-evaluation being systematically too lenient" in the harness design blog.
projects/cleanup_code_quality.txt is Claude's code quality improvement tracker. It lists "issues a Rust expert would flag":
Items a Rust expert would flag, roughly ordered by impact:
1. [DONE] GVN pass parameter threading -> GvnState struct
- gvn_dfs had 14 params, process_block had 10 params
2. [DONE] Functions with 10-20+ parameters
- parse_declaration_rest: 22 params -> DeclContext struct
3. [PARTIAL] Visibility: was 836 pub + 18 pub(crate), now improvedComparing this to the 16 dimensions of buffa's rust-code-reviewer.md (introduced later), Claude here invented a simplified review framework itself. However, its coverage is far less than rust-code-reviewer (see appendix for detailed interpretation), focusing only on "structural" issues like too many parameters, overly broad visibility, string allocation, and code nesting, completely lacking dimensions requiring judgment like unsafe safety arguments, concurrency safety, and API design taste.
This precisely validates the core finding of Anthropic's official harness design blog: AI self-review lacks the dimensions and depth of external review.
From the claudes-c-compiler project, we can distill the following AI coding practices:
Claude naturally tends towards structure. No one told Claude to write a "Files to modify" list or "Started" timestamp in current_tasks/files. It developed this semi-structured format on its own. This shows that in long-running projects, AI will naturally invent information management structures, but quality depends on prompt guidance. Carlini'sAGENT_PROMPT.mdlikely contained "track progress" instructions, leading Claude to evolve this filesystem.Tracking 150+projects is an extreme form of "Testing as Navigation".new_projects_myasm.txttracks compilation results for 150+ projects across 4 architectures, with root cause analysis and fix records for every FAIL. This is an extreme implementation of Carlini's principle that "test quality determines code quality," where Claude treats the entire open-source ecosystem as a test suite.DESIGN_DOC.mdsets the global memory for project architecture. TheDESIGN_DOC.mdofclaudes-c-compilercontains complete pipeline diagrams, source tree structures, data flow descriptions, and a "design philosophy" chapter (written by Claude itself). However, compared to buffa'sDESIGN.md(detailed in the appendix), it lacks two key elements: decision evolution history (why another approach wasn't taken) and benchmark data for rejected solutions. These contents are scattered inideas/*.txtfiles but were not integrated into theDESIGN_DOC.
This indicates that while Claude can write good architecture documents, it does not spontaneously distill decision history into anti-regression knowledge. In buffa, it was a human (McGinniss) who decided to record the evolution from Cell<u32> to AtomicU32 and the failure data of pre-scanning in DESIGN.md. Claude also recorded similar information (fix records under ideas/), but did not elevate them to the level of design documents that "prevent future AI regression."
As for why Rust was chosen over C/C++, the official answer isn't explicitly argued, but the fact that zero unsafe blocks exist in 100,000 lines of Rust code is the most powerful answer.
If a C compiler were written in C/C++, every pointer operation, array access, and memory allocation in AI-generated code could hide bugs. These bugs wouldn't be caught at compile-time, only discovered upon runtime crashes or erroneous results.
With Rust, the compiler itself is the first line of review. Sixteen parallel AI agents ran unsupervised for two weeks; if the generated code had memory safety issues, cargo build would directly refuse compilation. No human review, no sanitizer, no fuzzer needed. Rust's type system transforms a large class of bugs from "runtime discovery" to "compile-time prevention."
For a fully autonomous scenario with --dangerously-skip-permissions, this isn't a "bonus" but a "prerequisite."
buffa: AI-led Production-grade protobuf Library
buffa is the protobuf foundation layer for connect-rust (introduced later), with architecture design led by Anthropic security engineer Iain McGinniss and most coding completed by Claude. It fills a gap in the Rust protobuf ecosystem: prost (the de facto standard) is in passive maintenance and doesn't support protobuf editions; Google's official protobuf-v4 supports editions but requires a C compiler. buffa is a pure Rust, no_std-compatible implementation centered on editions, passing the full suite of protobuf binary and JSON conformance tests.
The core innovation is a dual-layer type system. Each protobuf message generates two Rust types: MyMessage (owned, heap-allocated) and MyMessageView<'a> (zero-copy borrow). View mode decoding reaches 1,772 MiB/s, which is 156% faster than prost. OwnedView<V> extends the view's lifetime to 'static via transmute, allowing it to safely cross async boundaries.
Unlike the C compiler, buffa is explicitly positioned as production-grade code. "Running in Anthropic production environments," not just a capability showcase. The difference lies in verification completeness: buffa has an authoritative, comprehensive test suite (Google's protobuf conformance), whereas the C compiler's verification, while broad, is incomplete.
Among Anthropic's three open-source Rust projects, buffa is the most worthy of study for AI coding practices because it occupies a unique position: it is the only AI-written code explicitly used in production environments.
claudes-c-compiler is a capability showcase, with the README stating "I do not recommend you use this code." connect-rust has no AI generation claim, with core code written by humans. Only buffa is marked both "Written by Claude ❣️" and "Running in Anthropic production environments."
This means buffa's engineering practices are not academic discussions; they have been validated by production environments. The methodology extracted from buffa is "proven effective," not just "sounds reasonable."
Furthermore, buffa implements a protobuf runtime + codegen state machine which is complex (wire format), has numerous edge cases (unknown fields/enums/recursion), is performance-sensitive (decode loop), demands high API design standards (Rust type system + ergonomics), and requires long-term maintenance. This is currently the type of code where AI is most likely to fail. buffa did not fail.
Due to space constraints, I will deconstruct Buffa's AI engineering practices in a separate article.
connect-rust: Human-led RPC Framework
connect-rust is the Rust implementation of the ConnectRPC[6] protocol, also developed by McGinniss. Built on the Tower middleware framework, it supports Connect, gRPC, and gRPC-Web protocols simultaneously on the same set of handlers. Anthropic has submitted RFC 007[7], applying for it to become the standard Rust implementation of ConnectRPC.
ConnectRPC is an RPC protocol created by Buf Technologies, currently a CNCF (Cloud Native Computing Foundation) Sandbox project. Its core positioning is as a simplified alternative to gRPC, addressing several pain points of gRPC in practical use:
gRPC's Problems: Mandatory HTTP/2 (unsupported by much infrastructure), custom binary frame format (cannot debug with curl), requires additional grpc-web proxy for browsers, error messages are binary-encoded protobufs (unreadable). ConnectRPC's Solution: Works on both HTTP/1.1 and HTTP/2, unary RPCs are standard POST requests (callable directly with curl), error messages are JSON, natively supported by browsers (no proxy needed), and compatible with both gRPC and gRPC-Web protocols (one set of handlers for three protocols).
Official implementations exist for Go, TypeScript, Swift, Kotlin, and Python. Rust was the missing piece.
Current advantages of connect-rust:
Stunning performance data: Latency is about half that of tonic (1.95× lower), and throughput is 33% higher than tonic under high concurrency (c=256) (112K vs 84K req/s). CPU profiling shows allocator pressure accounts for only 3.6% CPU (tonic+prost is 9.6%). connect-rustpassed all 12,800+ conformance tests, fixing multiple security vulnerabilities including decompression bombs and unbounded TLS handshake timeouts.
The key difference from the previous two projects is that connect-rust has no "Written by Claude" marker. The repository has a .claude/agents directory (containing rust-code-reviewer.md), indicating AI participation in auxiliary tasks like code review, but the core code is human-written.
Although connect-rust is human-led, it also employs AI for code review. Therefore, I have added a detailed interpretation of rust-code-reviewer.md in the appendix.
Production Requirements Exposed by Two Other Forked Rust Projects
Beyond the three original projects, Anthropic also forked two critical infrastructure libraries in the Rust ecosystem and made substantial modifications. These forks are not mere "version-locking" insurance operations; each fixes or adds issues that only appear in high-concurrency production environments.
tokio fork: Scheduler Stall Detection
Anthropic's tokio fork has 6 merged PRs. Five are for CI/release infrastructure (Artifactory publishing pipelines, OIDC token authentication, etc.), indicating the forked tokio is published as an internal crate via a private Artifactory repository. The only code modification is:
feat: add scheduler stall detection — Adds asynchronous scheduler stall detection functionality.
In normal operation, tokio's work-stealing scheduler rapidly circulates tasks across multiple worker threads. However, if a task blocks a worker thread (unexpected synchronous IO, holding locks too long, CPU-intensive computation without yield), the entire scheduler's throughput degrades silently. The original tokio lacks a built-in mechanism to detect this.
Considering Anthropic's usage scenarios, where connect-rust decodes 5.6 million logs per second under high concurrency and inference services route across thousands of accelerators, even a few milliseconds of scheduler stall can cause request latency spikes. This feature allows operations teams to identify in real-time which task is stalling the scheduler. It directly echoes the "Observability" dimension (item 16: tracing spans, metrics exposure) in rust-code-reviewer.md (see appendix); scheduler stall detection is essentially runtime-level observability.
moka fork: TOCTOU Race Fix in Concurrent Cache
moka is the most popular high-performance concurrent cache library in the Rust ecosystem, comparable to Java's Caffeine. Anthropic forked it and submitted a PR fixing a severe concurrency bug:
fix: TOCTOU race in and_compute_with under multi-threaded tokio
The issue lies in the and_compute_with method—this API's semantics are "return if found, compute and cache if not," which is the core operation of concurrent caching. In try_compute and try_compute_if_nobody_else, the waiter was removed from the waiter map before the cache was actually written, creating a TOCTOU (Time-of-Check to Time-of-Use) window: "concurrent callers can insert their own waiter → read old value → both callers execute writes based on stale data." The consequence is silent data loss under multi-threaded tokio runtimes; in counter tests with 8 concurrent writers, 10-15% of increments were lost.
The fix delays waiter removal until after the cache write is complete, ensuring the waiter blocks concurrent callers during the update.
What Do the Forks Expose? Some Inferences
Anthropic's fork of moka indicates extensive use of concurrent caching in its internal services. Combined with the tech stack, the most likely usage scenarios include:
Hot Path Caching for Inference Services. The biggest performance optimization in LLM inference is KV Cache reuse (sharing attention calculation results of prompt prefixes across multiple requests). The index table caching "which prefixes are computed on which GPU" requires extremely high concurrency read/write. moka's and_compute_with(the very method with the fixed TOCTOU) follows the "return if found, compute and cache if not" pattern. Model routing tables (which request goes to which accelerator cluster) are also typical read-heavy, write-light cache scenarios.Token Counting and Rate Limiting. Request counting and quota tracking per user or API key. Every request needs to increment a counter. TOCTOU causing "8 concurrent writers losing 10-15% of increments" means users can exceed limits in rate-limiting scenarios—a direct revenue loss. Authentication Token Caching. Caching validation results for JWT/OIDC tokens. Validation requires cryptographic operations (signature verification), and results can be cached until expiration. Combined with the zeroizerequirement from the Security dimension in rust-code-reviewer (secure zeroing of sensitive data), token caches need secure memory clearing upon expiration.Schema Caching at the RPC Layer. buffa's JSON registry and Any registry look up message descriptors by type URL, a lookup occurring millions of times per second in high-throughput RPC.
The TOCTOU bug itself is also telling. A 10-15% loss rate might be negligible under low concurrency, but at Anthropic's scale (millions of users, thousands of accelerators), it means counts or states for a vast number of requests are silently lost every second. This bug must have been discovered by observing data inconsistencies in production (inaccurate rate limiting, anomalous cache hit rates) and then constructing a minimal reproduction.
Taken together, the two forks paint a clear picture: Anthropic is not "experimenting with Rust"; they are running Rust production services in deep water, encountering runtime-level issues only exposed by high-concurrency production environments, and possessing the capability to fix them at the tokio/moka level.
Differences in AI Participation Levels and Their Implications
The comparison of the three original projects reveals Anthropic's judgment on the applicable boundaries of AI coding:
C Compiler is suitable for 100% AI. It has mature test infrastructure (GCC torture test, numerous open-source projects as validation targets), tasks are highly parallelizable (different tests, files, backends), but code quality does not require production-grade standards.
buffa is suitable for AI-led. It has complete specifications (protobuf wire format spec), automated verification (conformance test suite), is highly localized (encoding an int32 doesn't require understanding the whole system), and has a limited decision space. Humans are responsible for architecture design (DESIGN.md) and taste judgments.
connect-rust requires human leadership. It faces combinatorial explosion in protocol interactions (Connect × gRPC × gRPC-Web × HTTP/1.1 × HTTP/2), cross-library integration complexity (Tower × hyper × rustls × tokio), security judgments requiring attacker mindset, and API design being a matter of taste rather than correctness.
Commonality: All five projects chose Rust. For AI coding, Rust's compiler acts as the "zeroth line of review." The type checker cannot be persuaded, is not lenient, and doesn't think "close enough is fine." The zero unsafe blocks in claudes-c-compiler prove this. Meanwhile, the two forks (tokio's scheduler stall detection, moka's concurrency race fix) prove that Anthropic's Rust services have entered the deep water requiring runtime-level customization.
III. Core Insights from the Official Harness Design Engineering Blog
Prithvi Rajasekaran (Labs team, published March 2026) addressed a question not directly touched upon in the C compiler and buffa projects: How does AI evaluate the quality of its own work?
Why Naive Approaches Fail
The blog begins by describing an experiment where a single Claude agent was tasked with completing a full application independently. The result: the application interface looked okay, but core functions didn't work at all. For example, in a 2D retro game maker, the sprite editor worked, but during actual gameplay, characters didn't respond to any input; the wiring between entity definitions and the game runtime was broken, with no UI hints indicating the problem.
The problem exists on two levels. First is context degradation: as the context window fills up, the model's coherence declines. Some models (like Sonnet 4.5) also exhibit "context anxiety," wrapping up work prematurely because they think they are nearing the context limit. The second, deeper problem is systematic bias in self-evaluation. When asked to evaluate code it just generated, AI confidently praises work that humans see as clearly mediocre. This bias is worst in subjective tasks (frontend design), but even in tasks with verifiable results (functional correctness), AI will "convince itself" that a bug isn't severe.
This forms an interesting contrast with the buffa project. buffa validates correctness via Google's protobuf conformance test suite, an external, independent, authoritative test suite, with no involvement from AI self-evaluation. claudes-c-compiler uses GCC torture tests and 150+ open-source projects for validation, similarly bypassing AI self-evaluation. However, in domains without such authoritative external test suites (like "does this frontend design look good?"), self-evaluation bias becomes a core obstacle.
Engineering Application of GAN Structures
Rajasekaran borrowed the core idea from GAN (Generative Adversarial Network): quality improvement driven by the adversarial relationship between Generator and Discriminator.
GAN (Generative Adversarial Network), proposed by Ian Goodfellow in 2014, is a deep learning architecture. Its core idea is extremely simple: let two neural networks compete against each other, one faking and one verifying, improving together through 对抗 (adversary). Generator: Aims to generate 尽可能 (as) realistic fake data (e.g., fake images). It starts from random noise and learns to output things that look like "real data." Discriminator: Aims to distinguish between real data and fake data generated by the Generator. It receives an image and outputs a probability: "Is this real or fake?" If the Generator is a counterfeiter, the Discriminator is the bill verifier. As counterfeits get better, the verifier gets sharper, ultimately pushing the counterfeiter to the extreme.
He designed a three-agent system:
Planner: Expands the user's 1-4 sentences into a full product specification. This agent is deliberately constrained to "only write product context and high-level technical design, no specific implementation details." Reasoning: If the planner tries to specify fine-grained technical details upfront,一旦 (once) a detail is wrong, the error cascades down the pipeline to all downstream implementations. Constrain deliverables, let the executor find the path. This shares the same philosophy as buffa's design principle ("descriptor-centric," giving codegen only a structured input and letting it decide how to generate Rust code).
Generator: Implements features sprint by sprint, using the React + Vite + FastAPI + SQLite stack, with git for version control. Before each sprint, the generator and evaluator negotiate a sprint contract, agreeing on what to do in this round and what counts as completion. The reason for this contract is that the product spec is deliberately kept high-level (to avoid planner detail errors), requiring an intermediate step to bridge "user stories" and "testable implementations." The generator proposes a solution, the evaluator reviews if it aligns with the spec, they iterate until consensus, and only then does the generator start coding.
This sprint contract mechanism is similar to buffa's CLAUDE.md. CLAUDE.md mandates "regenerate types if codegen changes" and "run reviewer agent before commit," essentially embedding mandatory checkpoints before or after code writing. The difference is that buffa's checkpoints are pre-defined by humans, while the harness's contract is dynamically negotiated by two agents.
Evaluator: This is the most critical role in the entire architecture. The evaluator interacts with the running application like a real user via Playwright MCP: navigating pages, taking screenshots, clicking buttons, filling forms, checking API responses and database states. It does not score based on static code or screenshots but actually operates the application before judging. After each sprint, the evaluator scores item by item against pre-defined criteria (detailed later); if any item falls below the threshold, the sprint fails, and the generator receives specific feedback to redo the work.
The idea of the Evaluator using Playwright for end-to-end testing is essentially the same pattern as Carlini using GCC as a "known correct oracle" to verify compilation results in the C compiler project: using external, objective verification to replace AI self-evaluation. buffa uses Google conformance tests, the C compiler uses GCC + 150 open-source projects, and harness design uses Playwright end-to-end interaction. Different media, same principle.
Regarding the engineering application of adversarial thinking in GAN structures, I summarize three directions:
Separation of generation, review, and verification to create an adversarial structure. Cross-review by different models, e.g., Claude Code and Codex, to create an adversarial structure. Using external real-world tests for verification to create an adversarial structure.
These three directions correspond to three different sources of "discriminators," forming an adversarial gradient from weak to strong.
Direction 1: Internal Adversary Generated by Role Separation
The same model (or even models from the same vendor), but split into different roles, permissions, and prompts. The 对抗 (adversary) comes from the structural separation of responsibilities.
This is the direction with the most evidence in Anthropic's practice. buffa's rust-code-reviewer (Opus, read-only permissions, 16 dimensions) reviews code written by Claude Sonnet. Harness Design's Generator-Evaluator loop. claudes-c-compiler's "coding agent" vs "code quality agent" vs "deduplication agent."
Adversarial intensity depends on three design parameters:
Model Hierarchy Gap: buffa uses Sonnet for writing and Opus for reviewing. The reviewer is "smarter" than the generator, creating a real gap in judgment. If the same model does both generation and review, the adversary weakens. The Harness Design blog found that evaluators are still 默认 (by default) lenient towards LLM output, requiring extra calibration.
Permission Asymmetry: The reviewer only has Read/Glob/Grep permissions and cannot modify code. This constraint looks like a limitation but is actually a guarantee of the adversarial structure. If the reviewer could modify code, it would tend to "just fix it and pass" rather than strictly reject.
Externalization of Evaluation Criteria: 16 review dimensions, four-dimensional frontend scoring system, sprint contracts. Standards are not in the agent's brain but in files, making them auditable, calibratable, and iterable.
The limitation of this direction is that the ceiling of the adversary is the model's own judgment capability. No matter how the adversary between two Claudes is designed, it cannot exceed Claude's depth of understanding of the problem. The Harness Design blog admits that even after calibration, deep bugs remain undetected.
Direction 2: Heterogeneous Adversary Generated by Cross-Model Cross-Review
Models from different vendors have different training data, alignment methods, and blind spot distributions. Using Claude to write and GPT to review (or vice versa), the adversary comes from non-overlapping cognitive blind spots.
This is a direction not directly practiced in Anthropic's public materials; their projects exclusively use their own models. However, from an adversarial theory perspective, this direction theoretically has higher adversarial intensity:
Claude might be systematically lenient on certain patterns (e.g., naturally familiar with its own generated code style); switching to another model for review could break this "homogeneous bias." Just as in human teams, engineers from the same school tend to have similar blind spots; introducing people from different backgrounds discovers more issues.
The difficulty in practice is standard alignment. buffa's rust-code-reviewer.md is written for Claude; the priority sorting of 16 dimensions, output format requirements, and few-shot calibration are all tuned for Claude's behavioral characteristics. Switching to Codex or Gemini to read the same prompt may not yield better review quality, as their response patterns to prompts differ.
However, this direction has a unique advantage: preventing single points of failure. If the entire toolchain relies on a single model, that model's systemic defects become undiscoverable. The code quality of claudes-c-compiler was described by Carlini as "reasonable but far from expert-level"; using a model with different preferences for Rust idioms for review might reveal patterns Claude systematically ignored.
Direction 3: Absolute Adversary Generated by External Real Verification
The discriminator is not another AI, but reality itself. Code either passes the test or it doesn't, either compiles the Linux kernel or it doesn't, either Playwright can click through the flow or it can't.
This is the direction with the highest adversarial intensity among the three, because reality cannot be persuaded, cannot be calibrated, and has no leniency bias.
In Anthropic's practice, this direction has four specific forms:
Authoritative Test Suites: buffa uses Google protobuf conformance tests, claudes-c-compiler uses GCC torture tests (99% pass rate). These tests are maintained by third parties and cover boundary conditions that both humans and AI might miss.
Real Project Compilation: The C compiler uses 150+ open-source projects (SQLite, Redis, PostgreSQL, FFmpeg, Linux kernel) as integration tests. Each project represents the sum of countless implicit constraints in hundreds of thousands of lines of code, more unpredictable than any artificially designed test suite.
End-to-End User Interaction: Harness Design's Evaluator actually operates the application via Playwright MCP. It doesn't check code logic but checks "can the user complete the task?"
Comparison with Known Correct Implementations: The C compiler uses GCC as an oracle. Randomly select files to compile with GCC and with Claude, then compare results. This is the purest adversary—no need to define what is "correct," just compare with a known correct implementation.
The limitation of this direction is that not all quality dimensions have external verification. "Can the code run?" has external verification (tests), but "Is the API design good?" does not (taste). "Is the frontend good-looking?" is partially verifiable (Playwright can verify usability) and partially not (aesthetics are subjective). This is why the three directions need to complement each other.
Complementary Relationship of the Three Directions
The three directions are not substitutes but cover different quality dimensions:
buffa uses both Direction 1 and Direction 3: reviewer agent for role separation review (Direction 1), conformance test for external verification (Direction 3). The C compiler mainly relies on Direction 3 (GCC oracle + 150 projects), supplemented by Direction 1 (code quality agent). Harness Design uses all three: Generator/Evaluator separation (Direction 1), Playwright end-to-end testing (Direction 3), but no cross-model (Direction 2).
The most complete practice should be a three-layer superposition: external test suites 兜底 (underwrite) functional correctness, cross-model review captures systemic blind spots of a single model, and same-model role separation handles subjective dimensions like taste and style. Currently, Anthropic's public practice covers 1 and 3; Direction 2 is a valuable gap you pointed out.
Anatomy of Self-Evaluation Bias
The blog's analysis of self-evaluation bias is worth expanding on. Simply separating "doing" and "evaluating" into different agents is not enough; the separated evaluator is still an LLM, 默认 (by default) lenient towards LLM-generated output. Rajasekaran found he needed to repeatedly read the evaluator's logs, find where the evaluator's judgment deviated from his expectations, and then update the evaluator's prompt to correct these biases. This calibration loop went through several rounds before the evaluator's judgment reached a level he considered reasonable.
Even so, the harness's final output still had minor layout issues, unintuitive interactions, and deep functional bugs that weren't fully tested. The evaluator has an upper limit to its capability—it is most valuable on tasks near the model's capability boundary, but for problems beyond the model's understanding (like "does this game level design make sense?"), it is as powerless as the generator.
This forms a direct contrast with buffa's rust-code-reviewer.md. The reviewer agent uses model: opus (strongest model for review), tools: Read, Glob, Grep (read-only permissions), and 16 clear review dimensions with a structured output format (Executive Summary → Findings by Category → Critical Issues → Recommended Improvements). buffa's review framework is human-predefined, static, and has clear standards; the harness's evaluator is dynamic, based on runtime interaction, and requires iterative calibration. The two complement each other: buffa's mode suits scenarios where "code quality can be judged by static review," while the harness's mode suits scenarios where "quality must be verified by actual use."
claudes-c-compiler's approach is a third variant: the code quality agent (cleanup_code_quality.txt) is a simplified review framework invented by Claude itself, covering only structural issues like too many parameters and overly broad visibility, lacking dimensions requiring deep judgment like unsafe safety arguments and API taste. This validates the harness blog's core argument: the dimensions and depth of AI self-review are naturally inferior to external review.
Frontend Design Scoring: Grading Subjective Quality
Before applying to full-stack development, Rajasekaran experimented in the frontend design domain. He defined four scoring dimensions for both generator and evaluator:
Design Quality: Do color, typography, layout, and imagery combine into a whole with a unified mood and identity? Originality: Are there custom design decisions, or is it template layouts, library defaults, AI-generated patterns? Explicitly penalizes "AI slop" like purple gradients + white cards. Craft: Typography hierarchy, spacing consistency, color harmony, contrast; checks technical execution. Functionality: Usability assessment independent of aesthetics.
The key strategy is to weight Design Quality and Originality higher, and weaken Craft and Functionality. Because Claude is 默认 (by default) not bad at Craft and Functionality; technical ability is innate to the model. But in Design Quality and Originality, Claude tends to produce mediocre but "safe" output. This is the same logic as buffa's rust-code-reviewer.md ranking API Design first and Unsafe eighth: it makes no sense to add pressure on dimensions where the model is already good; pressure should be applied where the model is prone to errors.
An unexpected finding: the wording of scoring criteria directly shapes the generated output. Including phrasing like "the best designs are museum-level" in the criteria pushed the design to converge towards a specific visual direction. This means scoring criteria are not just evaluation tools; showing the same criteria to both generator and evaluator applies the same taste pressure on both generation and evaluation directions. This reminds one of buffa's DESIGN.md being read by both Claude Code (coding agent) and rust-code-reviewer (review agent); design principles guide both how code is written and define the standards for review.
In a case study of a Dutch art museum website, after 9 iterations, the generator produced a clean dark-themed landing page, visually polished but within expectations. Then in the 10th round, it completely overturned the previous scheme, reimagining the website as a spatial experience: CSS perspective-rendered checkered floors, paintings hanging freely on walls, navigating gallery rooms through doorways instead of scrolling or clicking. This creative leap was something Rajasekaran had never seen in single-round generation before; it came from the continuous accumulation of adversarial pressure.
Harness Evolution from Opus 4.5 to 4.6
The blog records an important evolution process. The initial complete harness was designed for Opus 4.5, including sprint decomposition, context reset, and QA per sprint. When Opus 4.6 was released, Rajasekaran removed components one by one to test which were still bearing weight.
Context Reset vs Compaction: This is a key distinction. Compaction summarizes early dialogue, with the same agent continuing work. Context Reset completely clears the context, starts a new agent, and passes state via structured files. Sonnet 4.5's "context anxiety" (model thinking it's near the limit and wrapping up sloppily) was severe enough that compaction wasn't enough; reset was necessary to give the agent a clean slate. Opus 4.6 basically eliminated this issue, able to work continuously for over two hours without reset.
claudes-c-compiler took a more extreme path: every session was a complete reset, with no compaction or context carry-over across 2,000 sessions. Claude compensated for the lack of session history by writing current_tasks/ and ideas/ in the filesystem, serializing the reasoning process that should exist in dialogue history into the git repository. This is the purest form of "context reset + filesystem handoff."
Cancellation of Sprint Decomposition: Opus 4.6's capability improvement made sprint decomposition unnecessary. The model can work autonomously for over two hours without losing coherence, no longer needing external task chunking.
Change in Evaluator's Role: After sprints were cancelled, the evaluator switched to doing one pass after the entire build. This changes the evaluator's load: for tasks the model can handle independently, the evaluator becomes unnecessary overhead; but for parts still at the model's capability boundary, the evaluator continues to provide real quality improvements.
The blog thus derives a universal principle: every component of a harness encodes assumptions about "what the model cannot do," and these assumptions become obsolete with model upgrades. When a new model is released, the correct approach is to strip away components that no longer bear weight one by one, while adding new components to achieve previously impossible higher capabilities. As the blog concludes: "The interesting harness combination space does not shrink as models improve. It shifts."
This principle also retrospectively explains the architectural differences between buffa and the C compiler: buffa (Opus 4.6 era) doesn't need context reset and sprint decomposition, with CLAUDE.md having only two concise rules; the C compiler (also Opus 4.6) used extreme per-session full resets but didn't need an evaluator because it had GCC torture tests and 150 open-source projects as external verification. Each project's harness shape is different because the assumptions they encode about "what the model cannot do" are different.
Actual Costs and Output
The blog provides direct cost comparison data:
A 20x cost gap bought a qualitative change in "whether core functions work"—not a linear improvement, but crossing the usability threshold.
The runtime details of the simplified harness's DAW (Digital Audio Workstation) are also noteworthy. The time distribution for three agents is: Planner 4.7 mins ($0.46) → First Build 2h 7m ($71) → First QA 8.8m ($3.24) → Second Build 1h 2m ($37) → Second QA 6.8m ($3.09) → Third Build 10.9m ($5.88) → Third QA 9.6m ($4.06).
Build accounts for the vast majority of costs; QA is cheap but caught real issues every round. The first round found multiple core DAW functions were "display-only" (lookable but not usable); the second round found audio recording was still a stub, clips couldn't be dragged or split. Without external checks, the generator tends to "finish the surface" and skip interaction depth, the same problem as the game maker in single-agent mode looking usable but not playable.
Comparing to the cost structure of claudes-c-compiler (two weeks, ~$20,000), harness design runs are much cheaper per run, but the complexity of output differs: one is a 100k-line compiler, the other a fully functional web app. A meaningful comparison is: in both schemes, human time is not spent writing code, but designing the harness (test environments, scoring criteria, agent architecture).
IV. Synthesis: Engineering Practices for AI Coding
The following practices are distilled from five projects (C compiler, buffa, connect-rust, tokio fork, moka fork) and two blogs (C compiler agent team, harness design).
DESIGN.md is AI's Long-Term Memory
buffa's DESIGN.md is the file with the highest information density in the entire project.
CLAUDE.md states See DESIGN.md for the architectural overview, meaning Claude Code reads it every time a new session starts.
One might wonder, why isn't this file named Arch.md?
**Arch.md implies "what the system looks like"**. Module division, data flow, interface definitions. These are static and descriptive. **DESIGN.md implies "why the system looks this way"**. Decision processes, rejected solutions, trade-offs. These are dynamic and argumentative.
In buffa's DESIGN.md, the actual architecture description (module boundaries, data flow diagrams, type mappings) accounts for only about 35%. The remaining 65% consists of five types of functional content:
The latter two, Decision History and Rejected Solutions, are added specifically for AI collaboration. They answer "why the code isn't different," which is exactly where AI is most prone to errors.
buffa is human-machine collaborative; humans write the decision history into the document, upgrading the document's nature from "architecture description" to "design argumentation."
In the C compiler project, Carlini had Claude maintain README and progress files itself to solve the same problem. But the quality of such self-maintained documentation is obviously inferior to human-prewritten DESIGN.md. This also explains why buffa is production-grade code, while the C compiler's code quality is "far from expert-level."
Detailed paragraph-by-paragraph interpretation see Appendix A:
buffa-design-annotated.md
Reviewer and Executor Must Be Separated
This is the most consistent finding across the three information sources:
buffa/connect-rust uses rust-code-reviewer.md (model: opus, tools: Read, Glob, Grep—read-only permissions) for review; the reviewer cannot modify code.
C Compiler uses specialized agent roles for code quality review and deduplication, separated from the coding agent.
Harness Design found AI is systematically too lenient when self-evaluating, necessitating an independent evaluator. Tuning an evaluator to be picky is far easier than making a generator self-critical.
Three sources independently point to the same conclusion: doing and evaluating cannot be the same agent. This is the natural expression of GAN adversarial structure in engineering practice.
The 16 review dimensions of rust-code-reviewer are also noteworthy: API Design ranks first (usability above all), Unsafe ranks eighth (safety is a result of good design, not an independent goal), and Security includes timing-safe and zeroize (exposing the need to handle sensitive data).
Detailed paragraph-by-paragraph interpretation see Appendix B:
rust-code-reviewer-annotated.md
Test Quality Determines Code Quality
The C compiler blog puts it most directly: Claude will autonomously go solve the problem you give it, so the task validator must be nearly perfect, otherwise Claude will solve the wrong problem.
The Harness Design blog uses Playwright MCP for end-to-end validation, with the evaluator navigating pages, taking screenshots, clicking buttons like a real user. A single sprint (Sprint 3) had 27 test criteria.
buffa relies on Google's protobuf conformance test suite as the "source of truth."
The common pattern among the three is: human core work shifts from "writing code" to "writing tests and designing verification environments." Tests are not just acceptance tools; they are the AI's navigation system.
But tests must be written for AI, not for humans:
Do not print thousands of lines of useless output (context pollution). Use ERRORprefix for error messages for easy grepping.Pre-calculate aggregate statistics to avoid manual analysis by AI. Add --fastoption for random sampling (solving AI's "time blindness").Test names themselves are specifications ( explicit_presence_with_zero_value).
Constrain AI Behavior with Infrastructure
All projects demonstrate the same pattern: constrain AI with structure and tools, not by preaching via prompts.
Record Rejected Solutions
The section "Rejected: Pre-scan capacity reservation for view Vecs" in buffa's DESIGN.md fully records benchmark results (20-97% performance regression) and failure reasons for two pre-scan solutions. docs/investigations/e0477-owned-view-send/ records the complete investigation process.
This is the most overlooked yet most valuable practice in AI coding: preventing AI from regressing to disproven solutions, which is one of the most common quality issues in AI coding. Seeing Vec growth overhead in code, AI's first reaction is "pre-allocate." Without a record of rejected solutions, every new AI session might re-propose this already disproven "optimization."
Three Strategies for Context Management
Three projects and two blogs showcase three different context management strategies:
Static Document Injection (buffa). Humans pre-write DESIGN.md, read by Claude Code every session. Highest quality, but requires upfront human investment.
AI-Maintained Runtime Documentation (C Compiler). Prompt instructs Claude to frequently update README and progress files. Suitable for autonomous scenarios, but documentation quality is inferior to human-written.
Context Reset + Structured Handoff (Harness Design). Clear context to start a new agent, passing state via sprint contract and scoring reports. Opus 4.5 strongly needed this (had "context anxiety"); Opus 4.6 basically doesn't.
Which strategy to choose depends on task type: static documents for human-machine collaboration, AI self-maintenance for full autonomy, reset + handoff for long-running tasks. But an important finding from the Harness Design blog is: the necessity of these strategies changes with model upgrades. Each component encodes assumptions about "what the model cannot do," which should be re-evaluated when new models are released.
Parallelism Key is "Task Divisibility"
The C compiler blog provides the most profound parallelization experience:
When a test suite has hundreds of independent failing tests, parallelism is simple. Each agent picks a different failing test to fix. But when 16 agents compile the Linux kernel, they all get stuck on the same bug, overwriting each other.
The solution is to create parallelizable subtasks using a known correct implementation. Use GCC to randomly compile most kernel files, letting Claude handle only the remaining files. This allows each agent to fix different bugs in different files.
This is a universal pattern: If your task is not parallelizable, find an "oracle" to split it. In Harness Design, the sprint structure plays a similar role, breaking a large application into independent functional sprints.
Agent Specialization
Both C Compiler and Harness Design demonstrate the value of agent specialization:
Review/Evaluation agents should be "more expensive" than coding agents. buffa uses Opus for review (Sonnet for coding); Harness Design's evaluator uses Playwright for end-to-end testing. Judgment is more valuable than generation speed.
V. Reusable Methodology Framework
Five-Layer Pyramid of Documentation
Summarizing the above, we can distill an AI collaboration documentation system:
Traditional projects only need L4 and part of L2. L1, L3, L5 are added specifically for AI collaboration. Sprint contracts and scoring criteria in the Harness Design blog are dynamic versions of L2; README self-maintained by Claude in the C compiler is an automated version of L2.
Five Core Principles
Principle 1: Test Quality Determines Code Quality. AI will solve the problem you give it, so the validator must be nearly perfect. Tests are written for AI (concise output, machine-parsable, sampling modes to solve time blindness).
Principle 2: Doing and Evaluating Must Be Separated. AI self-evaluation is systematically too lenient. Independent evaluator + read-only permissions + high-capability model = reliable quality feedback.
Principle 3: Constrain with Infrastructure, Not Prompt Preaching. CI pipelines, file locks, type checkers, scoring thresholds—these hard constraints are far more effective than "please write high-quality code."
Principle 4: Record Rejected Solutions. Preventing AI from regressing to disproven solutions is the most unique documentation requirement in AI coding.
Principle 5: Harness Simplifies with Model Upgrades. Each component encodes assumptions about "what the model cannot do." When a new model is released, strip away components that no longer bear weight and add new ones to achieve higher capabilities.
Applicable Boundaries of AI Coding
The last row is a dimension added by the two forks. Tokio's scheduler stall detection and moka's TOCTOU fix both belong to "runtime internals"; such work requires deep understanding of async scheduler work-stealing algorithms, memory ordering of concurrent data structures, etc., and remains the domain of human engineers.
VI. Conclusion
These seven information sources—Anthropic's five open-source projects and two engineering blogs—all point to the same underlying philosophy:
The upper limit of AI coding quality does not depend on model capability, but on the constraints, feedback, and verification structures you build around the model.
buffa's DESIGN.md is a static constraint (human-prewritten long-term memory), the C compiler's CI and test harness are dynamic constraints (runtime quality gates), and Harness Design's Generator-Evaluator adversary is an adversarial constraint (engineering application of GAN structure). These three constraint methods face different scenarios, but the underlying logic is consistent.
Meanwhile, the forks of tokio and moka reveal another dimension: AI-written code must eventually run on real infrastructure. When buffa and connect-rust enter production environments and bear the concurrent load of millions of users, the issues exposed are no longer about AI coding, but about underlying runtimes and dependency libraries: scheduler stalls, concurrent cache races. These problems require human engineers to troubleshoot and fix at the depth of tokio/moka, work currently beyond the boundary of AI coding capabilities.
Carlini said at the end of his C compiler blog that he was both excited and uneasy. As a former penetration tester, he said, "Programmers deploying software they have never personally verified really worries me." This concern forms a meaningful tension with buffa's approach: buffa passed all conformance tests and is used in production; the C compiler explicitly notes known defects. The difference lies in the completeness of verification. AI-written code is only trustworthy when the verification system is complete.
Anthropic itself is the best practitioner of this methodology. They know better than anyone where Claude is reliable (buffa's encoding/decoding), where it needs human architecture (connect-rust's RPC design), where it can be fully autonomous but without quality guarantees (C compiler showcase), and where AI should simply not intervene (tokio scheduler internals and moka concurrency primitive correctness fixes).
The last sentence of the Harness Design blog is worth serving as the conclusion of this entire article:
"The interesting harness combination space does not shrink as models improve. It shifts. The interesting work for AI engineers is constantly finding the next new combination."
Appendix
Appendix A: buffa-design-annotated.md — Paragraph-by-paragraph Chinese-English annotated interpretation Gist of buffa's DESIGN.md[8]
Appendix B: rust-code-reviewer-annotated.md — Paragraph-by-paragraph Chinese-English annotated interpretation Gist of connect-rust's rust-code-reviewer.md[9]
connect-rust: https://github.com/anthropics/connect-rust
Building a C compiler with a team of parallel Claudes: https://www.anthropic.com/engineering/building-c-compiler
[3]Harness design for long-running application development: https://www.anthropic.com/engineering/harness-design-long-running-apps
[4]anthropics/tokio: https://github.com/anthropics/tokio
[5]anthropics/moka: https://github.com/anthropics/moka
[6]ConnectRPC: https://connectrpc.com/docs/protocol/
RFC 007: https://github.com/connectrpc/connectrpc.com/pull/334
[8]buffa-design-annotated.md — buffa DESIGN.md paragraph-by-paragraph Chinese-English annotated interpretation Gist: https://gist.github.com/ZhangHanDong/f4ac670f2fdd939bc4355dad93df92b0
rust-code-reviewer-annotated.md — connect-rust rust-code-reviewer.md paragraph-by-paragraph Chinese-English annotated interpretation Gist: https://gist.github.com/ZhangHanDong/ebc9577991d0a5ce94f1d91c5d64fe40