Should We Build Harnesses? Stanford's Answer: Let AI Do It

Author | Huang Xiaoyi

Use the same model, switch to a different Harness, and programming benchmark scores double. The industry has argued about this for two months. Now Stanford says: stop arguing.

Harness is Taking Off, But It Is Also Causing Debate

The hottest AI engineering concept at the start of 2026 is Harness.

It refers to everything outside the model itself—prompt templates, context management, retrieval strategies, multi-step reasoning orchestration, and tool-calling logic. To sum it up in one sentence: how you invoke the model is as important as the model itself, or even more so.

After OpenAI's Codex team wrote 1 million lines of Agent code over five months, their biggest takeaway was "Agents aren't hard; Harnesses are." In the SWE-Bench Mobile paper, the same Claude Opus 4.5 achieved success rates ranging from 2% to 12% depending on the Harness—a six-fold difference. LangChain's coding Agent on Terminal Bench 2.0 improved its score from 52.8% to 66.5%, jumping from 30th to 5th place, solely by optimizing the Harness without modifying the underlying model.

The data is convincing enough. The concept of Harness quickly broke out of academic circles and became a buzzword in the industry.

But once a concept gets hot, controversy follows. Those throwing cold water on this Harness fever include OpenAI's Noam Brown, who says Harnesses are essentially crutches that models will eventually outgrow—citing the emergence of reasoning models that rendered countless carefully designed Agentic systems obsolete overnight. The Claude Code team also says, "All the secret weapons are in the model itself; pursue the thinnest possible wrapper."

Anthropic's practice offers a subtle perspective. They initially built a rather heavy Harness solution for Opus 4.5—a GAN-style adversarial architecture, three-Agent division, and sprint contracts. But when Opus 4.6 arrived, the Harness was directly simplified: sprint decomposition was removed, the overall structure was streamlined, and costs dropped from six hours and $200 to 3.8 hours and $125. Better performance, lower cost.

This approach is called Build to Delete—the thickness of a Harness depends on the current capability boundaries of the model. As the model gets stronger, the corresponding Harness should be stripped away.

So what is the essence of the debate? It is not whether Harnesses are important, because the data has already answered that. Rather, it is that a Harness is not a static thing—it needs to evolve continuously as models iterate, tasks change, and capability boundaries shift.

Yoonho Lee's team at Stanford and Omar Khattab at MIT saw this contradiction, and offered an unexpected answer:

Stop arguing. Let AI build its own Harness.

Meta-Harness: A Counter-Intuitive Brute-Force Solution

The full paper title is Meta-Harness: End-to-End Optimization of Model Harnesses, with authors including Yoonho Lee, Chelsea Finn (Stanford), Omar Khattab (MIT, creator of the DSPy framework), and others.

The core idea's counter-intuitive quality lies in this: let a sufficiently powerful coding Agent iteratively optimize its own Harness to fit the model. Do not compress anything during the process—store everything. Let the Agent browse, analyze, summarize, and then write a better Harness framework.

Specifically, all content produced in each iteration—including complete source code for candidate Harnesses, per-sample execution traces, and scoring results—is saved as files in a structured directory. No database, no vector retrieval—just the most primitive files and folders.

Then, a coding Agent is placed into this system with only one task: "Write a better Harness based on the experience of all previous attempts."

The outer loop is extremely simple: generate candidate, then evaluate, then save complete results, then Agent analyzes all history, then generate new candidate, then repeat. No fancy search algorithms, no evolutionary strategies, no gradient approximation. All the intelligence of the search comes from the Agent's own code understanding and reasoning capabilities.

Why Existing Methods Fall Short

This approach looks simple, but it solves a problem that no previous automatic optimization method could solve: information retention.

Past text optimizers—Google's OPRO, TextGrad, DeepMind's AlphaEvolve—share a fatal flaw: they compress historical feedback too aggressively. Some methods have no memory at all, starting from scratch each round. Some only retain a scalar score, such as "62% accuracy." Some compress execution processes into brief summaries.

This is like asking an engineer to debug a complex system but only telling them "the previous version scored 62"—no logs, no stack traces, no error samples. How would they know what to fix?

Meta-Harness takes the opposite approach. Each evaluation can generate 10 million tokens of diagnostic information—including complete execution traces for each sample's inputs, model outputs, correct answers, and intermediate reasoning steps.

The Agent isn't fed a summary—it is actually doing research—autonomously deciding which files to read. According to the paper, the Agent reads a median of 82 files per round. It examines the source code of previous best and worst performing Harnesses, spot-checks execution traces for specific samples, discovers "the model always classifies type A as type B for this sample category," compares differences between two Harnesses, and infers which design decision caused performance changes.

This process mirrors the workflow of an excellent engineer conducting experimental analysis—except it reads files hundreds of times faster and never gets tired.

Why It Is Only Feasible Now

The authors specifically point out a timing issue: Meta-Harness only became feasible in early 2026. The reason is simple—it completely relies on the qualitative leap in Coding Agent capabilities over the past year. Agents from two years ago couldn't autonomously navigate directories containing hundreds of files, perform meaningful analysis, or write executable code. Today they can.

This isn't just a methodological breakthrough; it is a story of timing. The improvement in Agent capabilities has suddenly made a solution that was previously the right idea but impossible to execute into a reality.

Three Battlefields, Three Dominating Performances

No matter how elegant the theory, numbers must back it up. Meta-Harness was validated on three vastly different tasks.

Performance comparison across three tasks

Battlefield One: Text Classification—Four Iterations Match Others' Forty

In text classification experiments, Meta-Harness achieved 48.6% accuracy, 7.7 percentage points higher than the previous best manual baseline ACE, which scored 40.9%. More notably, the efficiency: context token usage was 11.4K compared to ACE's 50.8K—nearly four times less. Better results, lower costs.

Convergence speed is equally impressive: requiring only four evaluation iterations to match competing methods requiring forty evaluations. The information density extracted by the Agent from complete traces each round far exceeds optimizers that only see scores or summaries.

The paper also conducted out-of-distribution generalization tests—taking the optimal Harness searched on five datasets and transferring it directly to nine unseen datasets—with results still outperforming ACE. This shows Meta-Harness finds not tricks specific to particular datasets, but better framework designs.

Battlefield Two: Mathematical Reasoning—Automatically Discovering Routing Strategies Humans Never Considered

On IMO-difficulty retrieval-augmented mathematical reasoning tasks, Meta-Harness automatically discovered a four-way routing BM25 retrieval strategy—the system learned to classify math problems into four categories: combinatorics, geometry, number theory, and default, using differentiated retrieval parameters for each. No human engineer had ever specified such refined routing designs in advance.

Diagram of the discovered routing strategy

Transfer capability is equally noteworthy: the optimal Harness searched using GPT-OSS-20B showed zero-shot transfer improvements across five unseen reasoning models. This means good framework designs work for different models—Harness optimization and model selection are orthogonal dimensions. In other words, your investment in Harness engineering will not be wasted when you switch models.

Battlefield Three: Coding Agents—Surpassing Human Harness Solutions

In the Claude Haiku 4.5 category, Meta-Harness indeed achieved 37.6%, taking first place in the category, surpassing all known manual Harnesses such as Goose and official Claude Code. In the Claude Opus 4.6 category, Meta-Harness scored 76.4%, taking second place.

Meta-Harness also autonomously discovered a key trick on this task: Environment Bootstrapping. Before the Agent executes tasks, it automatically runs shell commands to collect sandbox environment snapshots, including OS version, installed packages, and directory structure, then injects them into the initial prompt.

This eliminates the two to four rounds of environment exploration that Agents typically require—no more wasting inference steps to see what is in the directory. For coding Agents with limited token budgets, saving these rounds directly boosts effective reasoning capability. No one told the system to make this optimization; Meta-Harness discovered it during the search process.

Ablation Study: Information Volume Is the Key Lever

The paper presents a comparison of three information retention strategies, with clear results:

Retain only scores: median accuracy 34.6%

Scores plus summary: 34.9%

Complete traces (Meta-Harness): 50.0%

Complete traces yield a fifteen percentage point improvement, while summaries provide almost no help—even sometimes harmful, because compression discards seemingly trivial but crucial diagnostic details.

This is a conclusion worth chewing over for the entire AI optimizing AI field: when Agents are powerful enough, human preprocessing and compression does not help—it gets in the way. Give the Agent all the raw information and let it decide what to look at and what to ignore. The results are far better than human intervention.

The Same Bitter Lesson, Two Interpretations

Finally, let us return to the industry debate—placing Meta-Harness into the picture makes things quite interesting.

The industry categorizes Noam Brown's view into the Bitter Lesson camp because he mentions Harnesses are crutches: AI research has repeatedly proven that carefully designed systems relying on human domain knowledge will eventually be flattened by brute compute, so do not waste time on framework engineering—bet on continuous growth in model capabilities.

Meta-Harness also uses the Bitter Lesson: AI research has repeatedly proven that general search defeats careful manual design—so do not manually design Harnesses; let AI find optimal solutions through general search. It does not deny that Harnesses are important, nor does it deny that models will keep getting stronger. It says: since manual Harnesses will eventually be eliminated, let AI take over.

Simply put, Noam Brown's version is "do not bother building Harnesses," while Meta-Harness's version is "do not bother building Harnesses manually."

Meta-Harness essentially redefines the coordinate system of this debate. Model and Harness are not either-or choices. When Harness optimization itself is automated, the two paths naturally converge—as models get stronger, the optimal Harness found by Meta-Harness will also become thinner. Anthropic's manually executed Build to Delete happens automatically under this framework.

This itself is the kind of greater computation that the Bitter Lesson says will always win.

The paper team proposes a further direction at the end: co-evolution of Harness and model weights. Today model training and framework design are still two separate processes. But if Harnesses can be automatically optimized, how should future model training incorporate Harnesses into the optimization loop?

Coincidentally, former Alibaba Qwen technical lead Lin Junyang has been saying similar things recently. In his post-departure essay "From Reasoning Thinking to Agentic Thinking," he pushes the role of Harness to an even more granular position—not just the runtime framework for inference, but core infrastructure for training. What kind of Harness environment Agents are trained in determines what they can learn.

Now Stanford has let AI take over the Harness for inference. Is Lin Junyang aiming at the training-time Harness?

There is an intriguing distinction here: the Harness for inference has clear goals, with scores showing clear winners, and AI is faster than humans. The Harness for training defines whether the model's overall capabilities become stronger after training in this environment—a long-term, sparse, difficult-to-attribute process. Building this layer will probably still require humans.

The direction is set. Who moves first? The poker table of the second half of 2026 may have another new question.

Should We Build Harnesses? Stanford's Answer: Let AI Do It

Related Articles

分享網址