Meta-Harness: Stanford's Latest Harness Paper Earns Praise from Lin Junyong

Hello everyone, I'm PaperAgent, not Agent!

Today, I'm sharing a latest Harness paper from Stanford that was praised by Lin Junyong (former head of Alibaba's Qwen). He commented directly: "nice work"

"The combination of 'Model + Harness' has surpassed the model itself. Agent performance is significantly influenced by the design and quality of the Harness. I firmly believe this is the right research direction. Great job!

Meta-Harness introduces an outer-loop optimization framework that enables coding agents to automatically search and optimize the "Harness" of large language models (i.e., the code that controls information storage, retrieval, and presentation). By granting agents file system access to complete historical experience (source code, execution traces, scores), the system significantly outperforms human-designed Harness across three domains: text classification, mathematical reasoning, and agentic coding. It achieves a 10x improvement in search efficiency and significant performance breakthroughs.

Why Do We Need to Optimize Harness?

The performance of large language models (LLMs) depends not only on model weights but also heavily on their Harness—the code logic wrapped around the model that determines:

What to store: Which historical information is worth retaining
What to retrieve: When to extract relevant content from memory
What to present: How to construct the context provided to the model

Figure 1: (Left) On text classification tasks, Meta-Harness achieves the same accuracy in just 4 evaluations that other methods require 40 evaluations to reach. (Right) On TerminalBench-2, the Harness discovered by Meta-Harness surpasses all Claude Haiku baselines.

Research shows that changing the Harness of a fixed model can produce performance gaps of up to 6x on the same benchmark [47]. However, current Harness engineering still relies mainly on manual trial-and-error: developers inspect failure cases, adjust heuristic rules, and iterate on limited designs.

Limitations of Existing Text Optimization Methods

Existing text optimizers (such as OPRO, TextGrad, AlphaEvolve) can iteratively improve text, but they over-compress feedback:

They rely only on scalar scores
They only access the current candidate (no memory)
They limit feedback to short templates or LLM-generated summaries

Table 1: Comparison of text optimization methods. Meta-Harness can process up to 10 million tokens of diagnostic information per step, which is 3 orders of magnitude higher than existing methods.

This compression is particularly detrimental in Harness engineering: the impact of a Harness has long-range dependencies—a decision about storage or retrieval may not manifest its effects until many steps later. Compressed feedback often loses the information needed to trace failures back to earlier Harness decisions.

Meta-Harness Core Methodology

The core innovation of Meta-Harness is exposing complete historical experience through a file system, allowing coding agents (rather than fixed optimization algorithms) to decide how to diagnose and improve the Harness.

Search Loop

Figure 2: Meta-Harness search loop. (1) The agent reads a file system containing all previous Harness source code, execution traces, and scores. (2) The newly proposed Harness is evaluated. (3) All logs are stored in a new directory in the file system.

Key Design:

Agentic Proposer: Uses coding agents like Claude Code, rather than raw LLMs, which can actively query the file system using tools like grep and cat.
Complete Experience Storage: Each candidate Harness's directory contains:
- Complete source code
- Evaluation scores
- Execution traces (prompts, tool calls, model outputs, state updates)
Selective Diagnosis: The agent reads a median of 82 files per iteration (41% source code + 40% execution traces), rather than loading everything at once.

Why Does This Work in Code Space?

Harness optimization occurs in code space:

Structural Impact: Small changes to retrieval/memory logic can produce large effects after multiple steps.
Interpretability: By examining execution traces, agents can infer causes of failure (e.g., "retrieval at step 15 caused subsequent state pollution").
Natural Regularization: Code models tend to propose coherent algorithms rather than brittle hard-coded solutions.

Experimental Results

1. Online Text Classification

Tested on three datasets—LawBench, Symptom2Disease, and USPTO—using GPT-OSS-120B as the classifier:

Table 2: Online text classification test results. Meta-Harness surpasses ACE by 7.7 points in average accuracy while reducing context token usage by 4x.

Key Findings:

Accuracy Improvement: 48.6% vs ACE's 40.9%, a 7.7-point increase.
Context Efficiency: Uses only 11.4K tokens, compared to ACE's 50.8K (a 4x reduction).
Speed: Achieves the accuracy of OpenEvolve/TTT-Discover's 40 evaluations in just 4 evaluations (a 10x efficiency improvement).

Figure 3: Pareto frontier of accuracy vs. context tokens. Meta-Harness discovered a broad accuracy-cost trade-off curve.

OOD Generalization: On 9 unseen text classification datasets, Meta-Harness achieved an average accuracy of 73.1%, surpassing ACE's 70.2% (Table 5).

2. Retrieval-Augmented Math

Tested on 200 IMO-level math problems, with a retrieval corpus containing 500,000+ solution processes:

Table 6: Retrieval-augmented math problem solving. A single discovered Harness improves performance by an average of 4.7 points across 5 held-out models.

Remarkable Findings:

The discovered Harness generalizes across models: It consistently improves performance on GPT-5.4-nano, GPT-5.4-mini, Gemini-3.1-Flash-Lite, Gemini-3-Flash, and GPT-OSS-20B.
It achieves an average improvement of 4.7 points, surpassing BM25 retrieval (+3.4 points) and Dense Retrieval (+0.3 points).

Discovered Routing Strategies (Figure 8):

Combinatorics: BM25 retrieve 20 → deduplicate to 8 → reorder by difficulty → take top 3
Geometry: 1 hard NuminaMath reference + 2 BM25 neighbors (no reordering)
Number Theory: BM25 retrieve 12 → reorder by vocabulary score, difficulty, and technical explicitness
Algebra/Others: Adaptive K-value selection

3. Agentic Coding: TerminalBench-2

Evaluated on 89 high-difficulty terminal tasks (requiring long-range autonomous execution):

Table 7: TerminalBench-2 pass rates. Meta-Harness ranks 2nd on Claude Opus 4.6 and 1st on Claude Haiku 4.5.

Breakthroughs:

Opus 4.6: 76.4% pass rate, surpassing Terminus-KIRA (74.7%), second only to ForgeCode (81.8%, which was not reproducible).
Haiku 4.5: 37.6% pass rate, surpassing Goose (35.5%), with even more significant improvements on weaker models.

Key Mechanism Discovered: Environment Bootstrapping—Before the agent loop begins, execute shell commands to collect an environment snapshot (OS, installed languages, package managers, /app directory) and inject it into the initial prompt, saving 3-5 exploration steps.

In-Depth Analysis

Information Access Ablation Study

What makes Meta-Harness so effective? Comparing three information access methods:

Table 3: Proposer information ablation. Access to scores only: 41.3% best accuracy. Scores + summary: 38.7%. Full access (including execution traces): 56.7%.

Conclusion: Access to raw execution traces is a key ingredient for Harness optimization. Summaries may actually compress away information useful for diagnosis.

Qualitative Analysis: How Does the Agent Learn?

In the TerminalBench-2 search logs (Appendix A.2), the agent demonstrates causal reasoning ability:

Rounds 1-2: Modified structural fixes and prompt templates simultaneously → performance regression.
Round 3: Explicitly diagnosed that "the root cause of regression is the prompt template change, not the structural fix" → isolated testing.
Round 7: Shifted to purely additive modifications (environment snapshot) → best candidate.
Round 8: Attempted combination (environment snapshot + early fix) → further optimization.

This ability to identify confounds from failures and adjust strategies is enabled by complete file system access.

Examples of Discovered Harnesses

Draft-Verification Classification Harness (Figure 5)

# Two-stage pipeline
Stage 1: Retrieve 5 similar examples → Generate Draft label D
Stage 2: Retrieve 5 confirmers (=D) + 5 challengers (≠D) → Verify or correct D

Label-Primed Query Harness (Figure 6)

Constructs a single large prompt containing:

Label Primer: Lists all valid labels
Coverage Block: Most relevant examples for each label category
Contrastive Block: Pairs of examples that are similar but have different labels

https://arxiv.org/pdf/2603.28052
Project page: https://yoonholee.com/meta-harness/
Optimized harness: https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact
Meta-Harness: End-to-End Optimization of Model Harnesses