In today's rapidly evolving landscape of LLM Agents, designing appropriate Harness (constraints/guidance systems) has become a critical challenge. Today, we share two cutting-edge papers that propose automated Harness evolution methods from two distinct dimensions: memory systems and action constraints.
One paper from Microsoft, titled M⋆, focuses on equipping each task with its own exclusive memory Harness structure. The other, from Google, named AutoHarness, is dedicated to automatically generating code-level constraints to prevent illegal actions.
To be honest, my immediate reaction after reading these two papers was: the wind direction of AI research has truly shifted towards self-evolving Agents.
For those who want to delve deeper into this direction, we have compiled a collection of frontier papers and code covering Self-evolving Skills, Agent Systems, World Models, Context, Harness, and more.
I. M⋆: Every Task Deserves Its Own Exclusive Memory Harness
1.1 Core Problem: Limitations of Fixed Memory Structures
Current LLM Agent memory systems often adopt a "one-size-fits-all" design—whether it's semantic retrieval for conversational agents, skill systems for coding agents, or structured databases for professional domains. The problem is: memory designs optimized for one domain often fail to transfer to others.
As shown in Figure 1, conversational tasks (LoCoMo) require entity relationship graphs to track character relations, legal queries (PRBench) need relational databases to store precedents, while embodied intelligence (ALFWorld) requires trajectory lookup tables. These structural differences are vast and cannot be solved by a single generic solution.
1.2 Method: Executable Program Evolution
M⋆ represents the memory Harness as a Python Memory Program, containing three core components:
Schema: Defines the data format for storage and retrieval (using Python dataclass). Logic: Defines backend operations (write/read logic, capable of calling vector databases, SQL, or LLMs). Instruction: Defines prompt constants for how the Agent interacts with memory.
The system employs Reflective Code Evolution:
Validation Loop Sampling: Evaluates current programs using static and rotating validation sets. Coding Agent Iteration: Based on execution traces and failure cases, the LLM analyzes root causes and generates code patches. Constraint Checking and Auto-Repair: Compilation checks, smoke tests, and runtime constraints (e.g., return no more than 3000 characters).
Simultaneously, it adopts a Population-based Search Strategy to balance exploration and exploitation, selecting high-scoring programs for mutation via softmax temperature sampling.
1.3 Experimental Results
Across four distinct benchmarks (LoCoMo conversation, ALFWorld embodied, HealthBench medical, PRBench legal/financial), M⋆ achieved the best performance in 7 out of 8 configurations:
Table 1: Main experimental results comparison (partial data), showing M⋆ significantly surpasses fixed memory baselines on most tasks
Key findings:
Structural Diversity: Different tasks evolve distinctly different memory structures (see Figure 4 t-SNE visualization). For instance, the best program for ALFWorld uses a simple list + LLM summary, while LoCoMo uses a hybrid design of SQL + ChromaDB. Task Specificity: Cross-task migration experiments prove that applying a memory program evolved for Task A to Task B performs even worse than a generic baseline, proving that memory structures must be co-optimized with tasks.
II. AutoHarness: Automatically Generating Code Harness to Prevent Illegal Actions
2.1 Core Problem: The "Illegal Action" Dilemma of LLMs
Although LLMs perform excellently in code generation and mathematical reasoning, they frequently propose illegal actions in strictly defined environments (such as board games). In the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash failures stemmed from illegal moves.
Traditional solutions require manually writing constraint code (harness) for each game, which is laborious and error-prone. AutoHarness proposes letting the LLM automatically generate and optimize these code constraints.
2.2 Method: Tree Search + Thompson Sampling for Code Synthesis
Modeling Harness generation as a program search problem, it uses Thompson sampling-guided tree search to balance exploration (trying different logical structures) and exploitation (improving partially effective Harnesses).
It supports three Harness modes:
harness-as-action-filter: Generates a set of legal action candidates for the LLM to rank and select. harness-as-action-verifier (main experiment): LLM generates action → Code verifies legality → Retry if illegal. harness-as-policy: Implements the policy entirely in Python code, requiring no LLM calls during testing.
Key mechanisms:
Feedback-Driven: Environment returns whether an action is legal and reward signals. Iterative Optimization: Based on error cases and traces, the LLM generates code patches (V4A format). Compile-Repair Loop: Automatically handles syntax errors and runtime constraint violations.
2.3 Experimental Results
Comprehensive testing was conducted on 145 games in TextArena (excluding free-text conversation types):
Training Efficiency: On average, it takes only 14.5 tree search iterations to reach a 100% legal action rate, with 19 out of 32 games converging within 10 iterations.
Battle Performance (2P Games):
Gemini-2.5-Flash + Harness vs Gemini-2.5-Pro: 9/16 win rate (Overall win rate 56.3% vs 38.2%). Proves that smaller models equipped with specialized Harness can defeat larger models.
Single-Player Games (1P): Average reward of 0.745, surpassing Gemini-2.5-Pro (0.707) and GPT-5.2 (0.635).
Extreme Mode: Harness-as-Policy: When allowing the model to generate complete policy code (rather than just verifiers), it achieved an average reward of 0.870 on 16 single-player games, surpassing GPT-5.2-High (0.844), with near-zero inference cost (no LLM calls required).
Final Thoughts
Looking back at these two papers, you will discover a common trend: Research on Large Model Agents is shifting from "how to make models smarter" to "how to equip Agents with a more suitable Harness framework."
To go deeper in this direction, there are far more than just these two papers to read. We also share a collection of 120 high-quality papers (with source code) in the field of Large Model Agents.
Designing AI Agents: Orchestration, Memory, Plugins, Workflow, Collaboration
Sharing Two Latest Claude Skills Papers with 3 Core Conclusions