Sharing Two Latest Harness Papers: One from Google, One from Microsoft

In today's rapidly evolving landscape of LLM Agents, designing appropriate Harness (constraints/guidance systems) has become a critical challenge. Today, we share two cutting-edge papers that propose automated Harness evolution methods from two distinct dimensions: memory systems and action constraints.

  • One paper from Microsoft, titled M⋆, focuses on equipping each task with its own exclusive memory Harness structure.
  • The other, from Google, named AutoHarness, is dedicated to automatically generating code-level constraints to prevent illegal actions.
Overview of automated Harness evolution for LLM Agents

To be honest, my immediate reaction after reading these two papers was: the wind direction of AI research has truly shifted towards self-evolving Agents.

For those who want to delve deeper into this direction, we have compiled a collection of frontier papers and code covering Self-evolving Skills, Agent Systems, World Models, Context, Harness, and more.

I. M⋆: Every Task Deserves Its Own Exclusive Memory Harness

1.1 Core Problem: Limitations of Fixed Memory Structures

Current LLM Agent memory systems often adopt a "one-size-fits-all" design—whether it's semantic retrieval for conversational agents, skill systems for coding agents, or structured databases for professional domains. The problem is: memory designs optimized for one domain often fail to transfer to others.

Figure 1: Schematic diagram of different memory structures evolved for different tasks, showing unique memory Harness structures for Legal, Conversation, Embodied AI, and Healthcare domains
Figure 1: Schematic diagram of different memory structures evolved for different tasks, showing unique memory Harness structures for Legal, Conversation, Embodied AI, and Healthcare domains

As shown in Figure 1, conversational tasks (LoCoMo) require entity relationship graphs to track character relations, legal queries (PRBench) need relational databases to store precedents, while embodied intelligence (ALFWorld) requires trajectory lookup tables. These structural differences are vast and cannot be solved by a single generic solution.

1.2 Method: Executable Program Evolution

M⋆ represents the memory Harness as a Python Memory Program, containing three core components:

  • Schema: Defines the data format for storage and retrieval (using Python dataclass).
  • Logic: Defines backend operations (write/read logic, capable of calling vector databases, SQL, or LLMs).
  • Instruction: Defines prompt constants for how the Agent interacts with memory.
Figure 2: M⋆ system overview, showing the iterative process from Seed Memory Program to Program Pool, through Evaluate, Reflect & Mutate, and Quality Checks
Figure 2: M⋆ system overview, showing the iterative process from Seed Memory Program to Program Pool, through Evaluate, Reflect & Mutate, and Quality Checks

The system employs Reflective Code Evolution:

  1. Validation Loop Sampling: Evaluates current programs using static and rotating validation sets.
  2. Coding Agent Iteration: Based on execution traces and failure cases, the LLM analyzes root causes and generates code patches.
  3. Constraint Checking and Auto-Repair: Compilation checks, smoke tests, and runtime constraints (e.g., return no more than 3000 characters).

Simultaneously, it adopts a Population-based Search Strategy to balance exploration and exploitation, selecting high-scoring programs for mutation via softmax temperature sampling.

1.3 Experimental Results

Across four distinct benchmarks (LoCoMo conversation, ALFWorld embodied, HealthBench medical, PRBench legal/financial), M⋆ achieved the best performance in 7 out of 8 configurations:

Table 1: Main experimental results comparison (partial data), showing M⋆ significantly surpasses fixed memory baselines on most tasks

Table 1: Main experimental results comparison (partial data), showing M⋆ significantly surpasses fixed memory baselines on most tasks

Figure 3: Evolution trajectory graph showing validation scores changing over iterations across multiple benchmarks, presenting a three-stage pattern: early structural error fixing, mid-term significant improvement, and late-stage fine-tuning
Figure 3: Evolution trajectory graph showing validation scores changing over iterations across multiple benchmarks, presenting a three-stage pattern: early structural error fixing, mid-term significant improvement, and late-stage fine-tuning

Key findings:

  • Structural Diversity: Different tasks evolve distinctly different memory structures (see Figure 4 t-SNE visualization). For instance, the best program for ALFWorld uses a simple list + LLM summary, while LoCoMo uses a hybrid design of SQL + ChromaDB.
  • Task Specificity: Cross-task migration experiments prove that applying a memory program evolved for Task A to Task B performs even worse than a generic baseline, proving that memory structures must be co-optimized with tasks.
Figure 4: Program embedding space visualization, where different colors represent different benchmarks, showing each task converging to different structural clusters (LLM-Centric, Semantic Search, Hybrid Retrieval, etc.)
Figure 4: Program embedding space visualization, where different colors represent different benchmarks, showing each task converging to different structural clusters (LLM-Centric, Semantic Search, Hybrid Retrieval, etc.)
Additional visualization of program evolution metrics

II. AutoHarness: Automatically Generating Code Harness to Prevent Illegal Actions

2.1 Core Problem: The "Illegal Action" Dilemma of LLMs

Although LLMs perform excellently in code generation and mathematical reasoning, they frequently propose illegal actions in strictly defined environments (such as board games). In the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash failures stemmed from illegal moves.

Traditional solutions require manually writing constraint code (harness) for each game, which is laborious and error-prone. AutoHarness proposes letting the LLM automatically generate and optimize these code constraints.

2.2 Method: Tree Search + Thompson Sampling for Code Synthesis

Modeling Harness generation as a program search problem, it uses Thompson sampling-guided tree search to balance exploration (trying different logical structures) and exploitation (improving partially effective Harnesses).

Figure 1: Code-as-harness learning framework, showing nodes (code variants) in a tree structure selected via Thompson sampling, with new code generated by a Refiner based on evaluator feedback
Figure 1: Code-as-harness learning framework, showing nodes (code variants) in a tree structure selected via Thompson sampling, with new code generated by a Refiner based on evaluator feedback

It supports three Harness modes:

  1. harness-as-action-filter: Generates a set of legal action candidates for the LLM to rank and select.
  2. harness-as-action-verifier (main experiment): LLM generates action → Code verifies legality → Retry if illegal.
  3. harness-as-policy: Implements the policy entirely in Python code, requiring no LLM calls during testing.

Key mechanisms:

  • Feedback-Driven: Environment returns whether an action is legal and reward signals.
  • Iterative Optimization: Based on error cases and traces, the LLM generates code patches (V4A format).
  • Compile-Repair Loop: Automatically handles syntax errors and runtime constraint violations.

2.3 Experimental Results

Comprehensive testing was conducted on 145 games in TextArena (excluding free-text conversation types):

Training Efficiency: On average, it takes only 14.5 tree search iterations to reach a 100% legal action rate, with 19 out of 32 games converging within 10 iterations.

Figure 2: Curves showing the heuristic value (legal action rate) changing with synthesis iterations for 6 representative games
Figure 2: Curves showing the heuristic value (legal action rate) changing with synthesis iterations for 6 representative games

Battle Performance (2P Games):

  • Gemini-2.5-Flash + Harness vs Gemini-2.5-Pro: 9/16 win rate (Overall win rate 56.3% vs 38.2%).
  • Proves that smaller models equipped with specialized Harness can defeat larger models.
Figure 3: Bar chart of win/draw/loss rates against Gemini-2.5-Pro in 16 two-player games (green for win, red for loss)
Figure 3: Bar chart of win/draw/loss rates against Gemini-2.5-Pro in 16 two-player games (green for win, red for loss)

Single-Player Games (1P): Average reward of 0.745, surpassing Gemini-2.5-Pro (0.707) and GPT-5.2 (0.635).

Extreme Mode: Harness-as-Policy: When allowing the model to generate complete policy code (rather than just verifiers), it achieved an average reward of 0.870 on 16 single-player games, surpassing GPT-5.2-High (0.844), with near-zero inference cost (no LLM calls required).

Figure 5: Comparison of average rewards for different Agents in 16 TextArena 1P games, with Harness-as-Policy (orange) performing best
Figure 5: Comparison of average rewards for different Agents in 16 TextArena 1P games, with Harness-as-Policy (orange) performing best

Final Thoughts

Looking back at these two papers, you will discover a common trend: Research on Large Model Agents is shifting from "how to make models smarter" to "how to equip Agents with a more suitable Harness framework."

To go deeper in this direction, there are far more than just these two papers to read. We also share a collection of 120 high-quality papers (with source code) in the field of Large Model Agents.

Designing AI Agents: Orchestration, Memory, Plugins, Workflow, Collaboration

Sharing Two Latest Claude Skills Papers with 3 Core Conclusions

A Learning Lobster is a Good Lobster: OpenClaw-RL

2026: Two Must-Read Annual Reviews for Agentic AI

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.