Meta-Harness Supercharges Haiku's Performance, Even Rivalling Opus!

Reported by Xinyizhiyuan

Editor: Qingqing

[Introduction] What would happen if, one day, AI agents could tune their own parameters and fix their own bugs?

Recently, Yoonho Lee, a PhD student at Stanford's IRIS Lab, in collaboration with researchers from MIT and the University of Wisconsin, released a new paper that flips the logic of AI agent optimization on its head.

The author lineup is impressive, featuring guidance from robotics learning star Chelsea Finn and collaborator Omar Khattab, the creator of the DSPy framework.

In the past, the industry focused on scaling model parameters, training data, and RLHF. Meta-Harness takes a different approach: the "scaffolding" that supports the model's operation is equally critical to success.

Previously, these elements relied entirely on manual tuning. Now, Meta-Harness lets the AI do the work itself.

The results are remarkable: Claude Haiku 4.5 achieved a success rate of 37.6%, topping all Haiku agent leaderboards; meanwhile, Claude Opus 4.6 reached 76.4%, second only to the top-ranked ForgeCode.

Models are Commodities, the Harness Determines Success

A "harness" refers to the entire infrastructure: system prompts, tool definitions, retry logic, context management, sub-agent coordination, and lifecycle hooks.

If the model is the brain, the harness is the body that allows that brain to execute tasks.

This concept has exploded in popularity recently as the industry realizes that the same model can exhibit vastly different performance levels depending on the harness used.

In February, engineer Can Bölük conducted an experiment where he changed only the editing format without touching the model. The coding performance of 15 LLMs increased by 5 to 14 percentage points, while output tokens decreased by approximately 20%.

Even more striking, GPT-4 Turbo's accuracy soared from 26% to 59% simply by changing the editing format. With the model remaining constant, the harness was the only variable causing a twofold difference in performance.

The trend has become: Agent = Model + Harness. While models provide intelligence, the harness makes that intelligence useful. Projects like Claude Code and Codex are focused on this exact goal: designing precise harnesses to compensate for model shortcomings.

However, harness engineering currently relies heavily on human effort. Engineers must manually write prompts, tune tool interfaces, and design retry strategies, then run tests, analyze logs, guess the problem, modify code, and repeat. This cycle is tedious, and many failure modes are too complex for humans to diagnose easily.

Meta-Harness aims to automate this entire loop.

400x More Information: AI-Driven "Review + Iteration"

Meta-Harness attempts to provide the optimizer with more data. This sounds simple, but it has been the bottleneck of previous methods.

Comparison of context observation between Meta-Harness and mainstream optimization methods:

Self-Refine only looks at the most recent output and self-critique (approx. 1,000 tokens); OPRO looks at a few rounds of solutions and scores (approx. 2,000 tokens); advanced methods like TextGrad, AlphaEvolve, and GEPA range between 8,000 to 26,000 tokens.

Meta-Harness, however, can handle up to 10 million tokens—a 400-fold difference.

Why is this necessary? Because failure modes in harness engineering are often hidden in the minutiae of the execution trace. If a task fails, it might be because a tool call ten steps prior returned a truncated output, throwing off all subsequent reasoning.

If an optimizer only sees a scalar "failure" score or a compressed summary, it cannot locate the problem. Meta-Harness provides the "proposer" with a complete file system containing source code for all historical candidate harnesses, execution traces for every round, command logs, error messages, timeout behaviors, and scoring results.

The proposer can use standard tools like grep and cat to browse files and search for keywords as needed.

The optimizer is no longer just reasoning on a fixed prompt, but is an agent capable of retrieving information, browsing history, and editing code. The proposer uses Claude Code, which has the ability to decide what to look at and how to analyze it without needing compressed data.

The search loop is straightforward:

Proposer reads historical records from the file system.
Analyzes which tasks failed and why.
Targetedly rewrites the harness code.
New harness is tested, and results are written back to the file system.
The loop repeats.

The paper demonstrates the search process on a subset of 19 tasks. Starting from a Terminus-KIRA baseline of 28.5%, the success rate climbed to 46.5% by the 7th iteration.

Each round is based on "counterfactual diagnosis" using specific execution traces—asking "if I had handled it this way, would the result be different?" For example, the 7th iteration's improvement involved running a shell command before the first LLM call to inject environment dependency information into the initial prompt. This single command eliminated unnecessary trial-and-error, a level of diagnostic precision impossible with compressed summaries.

89 Tasks: Small Models Taking the Lead

Meta-Harness was tested in three scenarios: text classification, mathematical reasoning, and code agents. For code agents, the TerminalBench-2 benchmark was used, featuring 89 Dockerized tasks covering code translation, distributed ML configuration, system programming, bioinformatics, and cryptanalysis.

Each task is binary-scored over 5 runs. These tasks are highly difficult as they require long-term autonomous execution, complex dependency management, and the ability to handle truncated terminal output.

Meta-Harness optimizes the full coding harness, including system prompts, tool definitions, completion detection logic, and context management.

The results show that Claude Haiku 4.5 achieved a 37.6% success rate, ranking first among all Haiku 4.5 agents, surpassing Goose (35.5%). Claude Opus 4.6 achieved 76.4%, ranking second only to ForgeCode (81.8%).

Crucially, Haiku is the lightest version of the Claude series. Traditionally, smaller models are limited by a lower performance ceiling. Meta-Harness proves that by optimizing the harness, the ceiling for small models can be significantly raised.

Beyond Code: Effective for Text Classification and Math Reasoning

In text classification using LawBench, Symptom2Disease, and USPTO-50k datasets with GPT-OSS-120B, the best harness discovered reached 48.6% accuracy, 7.7 percentage points higher than the previous SOTA method, ACE. Furthermore, it was more cost-effective, using only 45.5K context tokens compared to ACE's 203K.

Compared to other program search methods with the same proposer and budget, Meta-Harness reached equivalent final accuracy with one-tenth the number of evaluations, eventually exceeding them by over 10 percentage points.

In mathematical reasoning, Meta-Harness searched for retrieval-augmented reasoning strategies. A single discovered retrieval harness improved performance across five new models by an average of 4.7 percentage points (from 34.1% to 38.8%) without changing the models themselves.

The competition for model capability is entering a new phase. While labs previously competed on parameters and data, the gap between top models like GPT-5, Claude 4, and Gemini 3 is narrowing on many tasks.

The real differentiator is now the harness. Model provides the intelligence, and the harness is the amplifier. Now, the optimization of the amplifier can be handed over to the AI itself.

Reference: https://x.com/yoonholeee/status/2038640635482456118

Meta-Harness Supercharges Haiku's Performance, Even Rivalling Opus!

[Introduction] What would happen if, one day, AI agents could tune their own parameters and fix their own bugs?

Related Articles

分享網址