A quick note: This is another survey about Harnesses. You may have recently read several articles and papers on this topic, possibly including my interpretation from last week, "Agent Harness Engineering: A Survey on the Chassis Engineering of Agents | CMU, Yale, Amazon".
Last week's "Agent Harness Survey" was more about answering a system architecture question: What should be wrapped around a truly usable Agent's exterior?
This latest survey from UIUC, Meta, and Stanford is concerned with a different question: When an Agent is placed into a long-term task environment, what is the operational object that truly strings together reasoning, action, feedback, verification, and collaboration?
Their answer: Code-based execution processes.
"Code" here does not refer to the Agent framework itself being written in code—that is common sense. It refers to a series of intermediate artifacts that the Agent continuously generates, runs, modifies, saves, and shares while performing tasks. In concrete engineering scenarios, this manifests as the Plan.md generated by Claude Code, or the output Skills.md, or Python files used for verification, and so on.
The original survey is 102 pages long with 478 references. This article will distill it for you, helping you quickly understand how these three top institutions have connected the underlying operational logic from Claude Code to robotic Agents.
How Does This Paper Define a Harness?
Before diving into technical details, we need to clarify a few core concepts.
Why do we need a "Scaffold (Harness)"?
A pure large language model is stateless; it essentially just predicts the next word. To transform it into an "intelligent agent" capable of executing long-term tasks, we need to wrap a layer of software infrastructure around the model. This infrastructure includes:
- Tools and API interfaces
- Secure sandboxed execution environments
- Memory and context management systems
- Validators and permission boundaries
- Control loops for execution and feedback
This entire peripheral system is what researchers call an Agent Harness.
Why is "Code" the Best Scaffolding Medium?
The researchers point out that code possesses three core characteristics that natural language lacks:
Executable: Code can be run directly on a computer, producing clear, objective results.
Inspectable: The execution process generates stacks, logs, and errors, which can be precisely tracked and analyzed.
Stateful: Code environments (like file systems, databases) can persistently save task progress.
Based on these three characteristics, the researchers constructed a three-layer architecture to systematically deconstruct the role of code in agents: the Harness Interface Layer, the Harness Mechanism Layer, and the Multi-Agent Extension Layer.
The paper breaks down code as an agent harness into three layers: the interface layer allows code to carry reasoning, action, and environment modeling; the mechanism layer handles planning, memory, tools, control, and optimization; the multi-agent layer turns code repositories, tests, trajectories, and execution states into a collaborative substrate.
Layer One: The Harness Interface
At this layer, code acts as the fundamental interface for the agent to communicate with the real world. It manifests in three ways: for reasoning, for acting, and for environment modeling.
The core of the interface layer is connecting model outputs to executable programs, tool calls, state tracking, and feedback trajectories, making reasoning verifiable, actions implementable, and environmental changes observable.
The paper organizes representative works based on the different roles of code in reasoning, action, and environment modeling, showing how this layer has expanded from program-aided reasoning to robotic control, GUI/OS operations, and software engineering evaluation environments.
Code for Reasoning
Early agents typically relied on pure text "Chain of Thought (CoT)" for reasoning, but this often led to logical errors or computational inaccuracies. Transforming the reasoning process into code allows external interpreters or solvers to verify the logic.
Programmatic Delegated Reasoning: The model no longer directly outputs computation results but generates a Python script to be run by a Python interpreter. This method completely separates high-level logical decomposition from low-level precise computation.
Formal Verification and Symbolic Reasoning: Combining formal proof languages like Lean, each step of the agent's reasoning can be automatically verified by a machine verifier. This is crucial in mathematical theorem proving and high-security code verification.
Iterative Code-Based Reasoning: The agent operates a closed loop of "generate code -> run code -> obtain error feedback -> fix code," using real execution trajectories to guide the next direction of reasoning.
Code for Acting
When an agent needs to interact with the physical world (robots) or the digital world (software GUIs), code becomes its execution vehicle.
Environment-Constrained Skill Selection: The agent does not directly generate low-level physical control commands but calls pre-written code skill libraries that comply with physical laws (e.g., the SayCan system), ensuring the feasibility of actions.
Programmatic Policy Generation: The agent directly writes control scripts containing conditional branches and loops. For example, generating a complete Python behavior tree code to finely control the movement of a robotic arm.
Lifelong Code Agents: Over long-term operation, the agent continuously encapsulates successful problem-solving operations into new code functions, storing them in a long-term "skill library" (like the famous Voyager system), achieving continuous capability evolution.
Code for Environment Modeling
The state of an environment is often complex and dynamic, and it is difficult to describe precisely using pure text. Code can concretize the environment into operable objects.
Structured World Representation: Using classes, object relationships, or tree structures in code (like a webpage's DOM tree) to accurately depict the spatial and logical structure of the current environment.
Execution Trace-Based World Modeling: The agent reads run logs and test results of the code to infer what changes have occurred in the environment state, thereby building a predictive model of the environment's dynamics.
Verifiable Environment Construction: Using code engineering methods like unit tests and mocks to build a miniature world with objective right/wrong judgment criteria for the agent.
Layer Two: Harness Mechanisms
With the foundational interface in place, the agent also needs a complex set of mechanisms to ensure it doesn't crash during tasks lasting hours or even days. The researchers categorize these mechanisms into five modules.
The mechanism layer covers five types of problems: planning, memory, tool use, control loops, and harness optimization. It emphasizes that agent reliability comes from the joint action of model judgment, variable task states, and governed runtime infrastructure.
Agent Planning Mechanisms
To handle complex software engineering tasks, an agent must have a clear execution path.Planning mechanisms can be single-path step decomposition, or they can use explicit structures, multi-path search, or system-level workflow orchestration to control the execution trajectory of long-duration tasks.
Linear Decomposition Planning: Breaking down large tasks into a linear list of steps (like generating a
PLAN.mdfile), with the agent strictly following the steps to generate code.Structure-Based Planning: Using the dependency graph of a code repository (AST, class relationship diagrams) to guide the sequence of operations. The agent can know which other files will be affected by modifying a function, thus formulating a safer modification plan.
Search-Based Planning: Introducing algorithms like Monte Carlo Tree Search (MCTS). When generating code, it explores multiple possible branches, and when hitting a dead end, it can backtrack using error information.
Orchestration-Based Planning: Dividing the task into different pipeline stages like understanding, retrieval, coding, and testing, controlling the agent's next step through system-level process scheduling.
Memory and Context Engineering
When dealing with codebases at the scale of millions of lines, large models are extremely prone to being constrained by context length limitations, thus requiring robust memory management solutions.The memory layer unifies working memory, semantic memory, experiential memory, long-term memory, multi-agent memory, and context compression into the same state governance problem. The goal is to retain evidence that truly affects task success within a limited context window.
Working Memory: Strictly managing the local state currently being edited (such as the line numbers of current files, recent error logs) to prevent the context from being flooded with irrelevant information.
Semantic Memory: Using Retrieval-Augmented Generation (RAG) techniques to accurately pull relevant class definitions, API interfaces, and historical documents from vast code repositories.
Experiential and Long-Term Memory: The agent transforms past bug-fixing experiences and successfully validated patches into structured experience bases, enabling cross-task knowledge reuse.
Context Compression and Offloading: When logs become too long, the system automatically performs coarse and fine-grained compression, or saves the full logs to external files, keeping only key summaries in the prompt.
Tool Use
Tools are the means by which agents change the external world, but within a code harness, tool use must be strictly governed.
The tool layer includes not just function calls and external APIs, but also terminals, repositories, sandboxes, validators, and multi-step workflows. The key issue is making tools discoverable, callable, auditable, and capable of recovery upon failure.
Function-Oriented Tools: Used to supplement knowledge the model lacks, such as calling external APIs to search documentation or query specific library function usage.
Environment Interaction Tools: Allow the agent to operate directly in a real environment, such as executing terminal commands (Shell), reading/writing files, and navigating code repositories.
Verification-Driven Tools: Using code linters, type checkers, or unit testing frameworks to provide deterministic, objective feedback on the agent's output.
Workflow Orchestration Tools: Responsible for scheduling the calling sequence of multiple sub-tools and handling exception recovery when tool calls fail.
The Plan-Execute-Verify Loop (PEV Loop)
The researchers point out that an Agent's debugging process is essentially a cybernetic problem and should be framed as the PEV Loop (Plan-Execute-Verify).The PEV loop organizes planning, sandboxed execution, static/dynamic verification, and permission control into a repeatable state transition process, allowing every modification by the agent to be observed, judged, and, if necessary, rolled back or escalated to a human.
Plan: Translating user requirements into explicit operational contracts, determining the scope of modification.
Execute: Must run in a Sandboxed Execution environment. Through isolated file systems and tiered permission controls, ensure the agent's destructive operations do not affect the host machine's security.
Verify: Utilizing static analysis and dynamic testing as "deterministic sensors." If tests fail, the agent must fix the issue based on logs; if high-risk operations are involved, human approval (Human-in-the-loop) must be mandatorily engaged.
Agentic Harness Engineering
This is an extremely avant-garde concept proposed by the paper. The system should not only stop at fixing the code itself but should also be able to automatically optimize the "scaffolding" surrounding the model.Agentic harness engineering treats prompts, retrieval strategies, tool descriptions, validators, permission rules, and workflows themselves as optimizable objects; but these modifications must be constrained by trajectory replay, held-out task evaluation, and governance rules.
Deep Telemetry: Comprehensively recording the agent's Token consumption, latency, tool call success rates, and complete execution trajectories.
Evolution Agent: A meta-level agent is specifically set up. It does not write business code but analyzes telemetry data to automatically modify retrieval strategies, update prompt templates, or refactor sandbox rules, thereby making the entire system increasingly stable.
Layer Three: Extending the Harness
Facing real-world, extremely complex enterprise-level demands, the context and capabilities of a single agent easily hit a bottleneck. Introducing Multi-Agent Systems (MAS) is an inevitable trend. At this stage, code formally becomes the "shared substrate" for communication, collaboration, and consensus-building among agents.
The multi-agent extension layer alleviates the bottlenecks of individual agents in context, specialized capabilities, and self-correction through role specialization, shared code substrates, execution feedback, and adaptive collaboration topologies.
Role Specialization
The system mimics a human software development team, splitting into highly specialized roles:
Coder: Responsible for specific code writing.
Tester: Specializes in writing tricky test cases, deliberately seeking vulnerabilities in the Coder's code.
Reviewer: Reviews code at the architecture and specification level.
Executor: Responsible for running code in a sandbox and collecting objective error logs.
Manager: Responsible for global task decomposition and process scheduling.
Interaction Modes
Pair Programming (Collaborative Synthesis): Two agents build code together, one responsible for navigation and planning, the other for specific implementation.
Critique and Repair: The most common mode, where a verifying agent proposes criticism and the programming agent modifies accordingly.
Adversarial Validation: Using techniques like Fuzzing to generate extreme inputs that deliberately trigger crashes, feeding the crash trajectory back to the coder.
Reasoning Debate: When multiple agents have disagreements on requirement understanding or code standards, they reach a consensus through multiple rounds of dialogue.
The Core Battleground: Shared Program State
The researchers sharply point out that many current multi-agent systems rely solely on "chat logs" to transmit information, which leads to severe "state divergence," where different agents have cognitive misalignments about the current state of the code.
Future multi-agent systems must establish a code-based objective global shared state:
- Whether through a real Git repository, an in-memory Blackboard architecture, or the complete execution context.
"Consensus" should not just be agents telling each other "it looks good," but must be the objective passing of all tests, clean static checks, and performance benchmarks being met.
The paper further breaks down multi-agent collaboration into four categories: workflow collaboration, shared repository state, execution verification, and adaptive coordination. It emphasizes that collaboration must land on inspectable program states, not just stay within chat logs.
Five Frontier Application Areas
The concept of "Code as Agent Harness" has already blossomed in the following five real-world application scenarios:
The paper summarizes the application scenarios as Code Assistants, GUI/OS Agents, Scientific Discovery, Personalization, and Embodied Agents, indicating that the code harness is expanding from software engineering to digital interfaces, research pipelines, and physical world control.
AI Code Assistants
Evolving from early simple code completion (like the early Copilot) to "automated R&D employees" capable of handling GitHub Issues (like SWE-agent, OpenHands). They can autonomously pull code, read errors, continuously trial-and-error in a local sandbox, and ultimately submit a Pull Request. In this process, the sandbox, testing framework, and Git version control are their harness.
GUI/Operating System Agents
Click operations on desktop or mobile screens are being transformed into executable code scripts (like Playwright scripts or DOM tree operations). Agents perceive the environment by reading the screen's HTML structure or Accessibility Tree and output Python code to execute clicks and swipes. The UI becomes a world manipulated by code.
Scientific Discovery
In automated laboratories, the research process is integrated into a seamless "code pipeline." From literature search and hypothesis proposal, to writing Python simulation programs, controlling real liquid handling robots for chemical synthesis, and processing experimental data—code runs through every link of scientific research (like the AI Scientist system).
Personalization Engines
Agents can automatically write and modify the policy code of recommendation systems based on real-time user feedback, and precipitate user preferences into persistent state objects readable by programs.
Embodied Agents
In the field of robotics, abstract action intentions are transformed into executable control code with kinematic parameters. Code acts as a safety boundary, ensuring the robot's actions (like robotic arm grasping) comply with physical laws, and completes debugging in the simulator code before entering the real physical world.
Pressing Challenges and Open Questions
Despite the vast prospects, the researchers soberly point out several core challenges currently facing the field:
The Bottleneck of Evaluation Metrics (Evaluation Beyond Final Success): Current evaluations mostly only look at whether "test cases pass." But this cannot distinguish whether the agent wrote elegant, high-quality code, or merely "patched over" the tests with a pile of fixes that destroyed the original system architecture. We need deeper semantic and architectural-level evaluations.
Incomplete Execution Feedback (Verification Under Incomplete Feedback): Sometimes code runs, but may contain security vulnerabilities or performance pitfalls. Current validators are still very weak when dealing with such non-functional requirements.
Regression-Free Self-Evolution: When a system attempts to automatically modify the harness or refactor code, it is extremely prone to "catastrophic forgetting," fixing one old bug but introducing ten new bugs.
Semantic Conflict Resolution in Multi-Agent Concurrency: When multiple agents modify different parts of the same codebase simultaneously, how to resolve their implicit conflicts in underlying business logic? Current text merging tools (like Git merge) cannot resolve deep logical fractures.
Safety Accountability and Human Oversight (Human-in-the-Loop Safety): When code agents are granted permission to directly operate production environments or even physical devices, we must establish an unbreakable interception mechanism to ensure humans have absolute veto power over high-risk operations.
Conclusion
This paper provides the AI field with an exceptionally clear and historically significant blueprint for "Agent Systems Engineering."
To truly bring AI into the complex real world, we must not rely solely on improving the computing power of large models. We must use "Code" as the system's skeleton, nerves, and muscles. Large language models provide a powerful "brain," while the code-based Agent Harness gives this brain a stable sandbox, real physical feedback, reliable memory mechanisms, and organizational principles for multi-role collaboration. Only by being deeply rooted in this "executable, inspectable, and stateful" code substrate can AI agents truly transform from demo-level toys into industrial-grade reliable productivity.
The future is here; let's journey together!