Swapping to the latest base models often fails to yield a qualitative leap in Agent performance. Conversely, equipping the same model with persistent memory, reusable skill documentation, and standardized tool interfaces delivers immediate, tangible results. Anyone involved in Agent engineering is likely familiar with this sensation: what lies outside the model often matters more than the model itself. But is there a unified framework to explain this phenomenon? A 54-page review paper from a team at Shanghai Jiao Tong University provides the answer: Externalization.
Recently, researchers from Shanghai Jiao Tong University, Sun Yat-sen University, Shanghai Institute for Advanced Study, Carnegie Mellon University, and OPPO submitted a comprehensive review to arXiv on April 9, 2026. For the first time, it systematically organizes the four pillars of LLM Agents—Memory, Skills, Protocols, and Harness Engineering—through the unified lens of "Externalization." The core thesis is clear: The actual progress of Agents increasingly depends on external cognitive infrastructure rather than improvements in the model's inherent capabilities.
Paper Title: Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Affiliations: Shanghai Jiao Tong University, Sun Yat-sen University, Shanghai Institute for Advanced Study, Carnegie Mellon University, OPPO
Paper Link: https://arxiv.org/abs/2604.08224 (Submitted April 9, 2026)
Authors: The lead author is Zhou Chenyu, a PhD student at Shanghai Jiao Tong University. Corresponding authors include Dr. Wang Jun from OPPO Research Institute, and Professors Liu Weiwen, Lin Jianghao, and Zhang Weinan from Shanghai Jiao Tong University.
Figure 1: Externalization as the Organizing Principle for LLM Agent Design
Models Are Strong, But Agents Remain Unreliable: Where Is the Contradiction?
Over the past two years, the parameter scale and reasoning capabilities of large models have continued to climb. However, engineers familiar with Agent deployment share a common experience: upgrading to a stronger base model often yields less significant improvements than enhancing external infrastructure. Persistent memory, reusable skills, standardized tool interfaces, sandbox constraints, and execution logs—these elements "outside the model" are increasingly determining whether an Agent is truly usable.
The paper attributes this phenomenon to three structural mismatches:
Continuity Mismatch: Context windows are limited and ephemeral; models cannot stably maintain state across sessions. Every session starts anew, requiring previously accumulated context to be rebuilt from scratch.
Consistency Mismatch: Complex multi-step processes are often re-derived rather than executed stably. For the same task, execution paths and quality vary depending on when they are invoked.
Coordination Mismatch: Interactions with tools, services, and other Agents rely on temporary agreements that are fragile and non-portable. Once an interface changes, the entire call chain may fail simultaneously.
The paper draws on cognitive scientist Don Norman's theory of "Cognitive Artifacts" to explain this. For instance, a shopping list does not expand human memory capacity but transforms the problem of "recall" into one of "recognition." A map doesn't make navigation stronger per se but makes spatial relationships visible rather than implicit. The power of external artifacts lies in Representational Transformation—they reorganize the form of the problem, allowing the subject to solve it more reliably with existing capabilities.
The same logic is unfolding in LLM Agents. The paper's core argument is that externalization is the unified logic understanding recent architectural evolutions in Agents, not merely a pile of engineering tricks.
From Weights to Harness: Three Shifts in the Carrier of Capability
Weights Layer (2022–2023): Capability was nearly synonymous with model parameters, dominated by scaling laws. This laid the foundation, but knowledge was hard to update selectively, behavior was difficult to audit, and personalization was nearly impossible.
Context Layer (2023–2024): Prompt engineering, Chain-of-Thought (CoT), and Retrieval-Augmented Generation (RAG) rose to prominence. Models remained frozen while prompt templates iterated rapidly. The difficult problem of "recall" was partially converted to "recognition," but state remained ephemeral, and cross-step coordination remained fragile.
Harness Layer (2024–Present): Reliability now depends on external memory, tool registration, protocols, sandboxes, and orchestration. "Agent engineering is increasingly becoming Harness engineering"—a pattern followed by OpenHands, SWE-agent, Deep Research, and others.
All Roads Lead to Externalization: Memory, Skills, Protocols, and Harness
Looking back at recent technical advances in the Agent field, memory systems, skill systems, protocol standardization, and Harness engineering itself appear to be four independent research lines solving different problems. However, the paper points out that they are essentially doing the same thing: migrating specific layers of cognitive burden from inside the model to external structures. This is not a coincidence but an inevitable convergence for reliable Agent deployment. The intersection of these four routes is Externalization.
Memory externalizes state, turning "recall" into "retrieval" to solve continuity mismatches. Skills externalize professional expertise, turning "improvisation" into "composition and reuse" to solve consistency mismatches. Protocols externalize interaction structures, turning "temporary agreements" into "structured contracts" to solve coordination mismatches. Finally, Harness externalizes the Agent's cognitive environment itself: execution flows, sandboxes, observations, and permissions, which were previously implicit in every model call, are now explicitly extracted to become inspectable, configurable, and governable infrastructure.
Memory: Externalized State
The paper organizes Agent memory into four layers: Working Context (current task state, open files, partially completed plans), Situational Experience (past run records and failure trajectories), Semantic Knowledge (domain facts, user preferences, general heuristics), and Personalized Memory (specific user habits and constraints).
Memory architectures have evolved with demand: from monolithic systems stuffing all history into prompts, to retrieval-based systems with active state and external storage, to hierarchical architectures orchestrating by semantics or time, and finally to adaptive memory systems that dynamically adjust retrieval strategies based on feedback. The core effect remains the same: the model no longer needs to "recall" from weights but "retrieves" from persistent storage.
Skills: Externalized Professional Expertise
Skill systems package reusable procedural expertise into explicit artifacts. A complete skill comprises three components: Operational Procedures (task skeletons and decomposition steps), Decision Heuristics (local strategies for branching decisions), and Normative Constraints (compliance, safety, and operational boundaries).
There are four generation paths for skills: Manual Writing (experts hand-crafting instruction files like SKILL.md), Trajectory Distillation (extracting reusable programs from historical run records), Autonomous Discovery (Agents exploring and inducing in the environment, e.g., Voyager), and Compositional Construction (assembling high-level capabilities from existing low-level skills). Skills move from "discovery" to "execution" through stages of registration, progressive disclosure (expanding from summary to full detail on demand), and composition, finally binding to specific tools, APIs, and protocols at runtime.
The core effect: The model no longer needs to "improvise" workflows from scratch each time but "composes" them from pre-validated components.
Protocols: Externalized Interaction Structures
Protocols fix interaction structures into machine-readable contracts, externalizing four types of burdens: calling syntax (parameter formats and types), lifecycle semantics (state transitions and completion conditions), permission and trust boundaries (authorization rules), and discovery metadata (declarations of available capabilities).
The paper outlines three major protocol families:
Agent-Tool Protocols (e.g., MCP): Standardize tool discovery and invocation via JSON-RPC, enabling dynamic registration and modular expansion of tools.
Agent-Agent Protocols (e.g., A2A): Define structured semantics for task delegation, progress exchange, and capability discovery, supporting interoperability in an open Agent ecosystem.
Agent-User Protocols (e.g., AG-UI): Make runtime observable and portable through typed execution events and state flows, allowing user interfaces to track Agent behavior in real-time.
The core effect: Temporary agreements become structured contracts, making cross-system coordination governable rather than fragile.
Harness: The Unified Cognitive Environment
Harness externalizes the cognitive environment upon which the previous three dimensions rely. Execution flows, sandboxes, observations, and permissions, previously implicit in every model call, are explicitly extracted to become inspectable, configurable, and governable infrastructure. This is both the runtime accommodating memory, skills, and protocols, and the key to transforming the entire system from a "black box" to a "white box." The paper analyzes its composition across six design dimensions:
Agent Loop and Control Flow: The complete cycle of Perception-Retrieval-Planning-Execution-Observation, managing termination conditions, recursion boundaries, and resource consumption.
Sandboxing and Execution Isolation: Filesystem isolation, network restrictions, and cloud sandboxes serve as both security and cognitive boundaries.
Human Supervision and Approval Gating: Pre-execution approval, post-execution review, and escalation triggers, treating autonomy as a configurable parameter.
Observability and Structured Feedback: Structured logs of tool calls, tracing actions back to their antecedents, supporting debugging, auditing, and internal feedback loops.
Configuration, Permissions, and Policy Encoding: Three-level hierarchical constraints (User, Project, Organization) enforced at runtime via declarative rules.
Context Budget Management: Balancing competition for the context window among the three dimensions through history summarization, priority-driven content eviction, and staged skill loading.
These three dimensions form a self-reinforcing loop within the Harness: memory experience distills into skills, and skill execution trajectories沉淀 back into memory; protocols standardize how skills are invoked and write structured results back to persistent state; richer memory leads to better skills, better skills generate richer execution trajectories, and the cycle continues.
A Scenario: Changing Only the "External Environment" Without Swapping the Model
Consider a software engineering Agent tasked with implementing a new feature, running tests, and submitting a PR in a large code repository. The paper uses this example to directly illustrate the significance of externalization.
Without Externalization: The model must cram the repository structure, project conventions, workflow state, and tool interactions into a fragile prompt window. A single error requires restarting the entire process. As task complexity increases, the management cost of prompt templates rises super-linearly.
With Externalization: Persistent project memory provides cross-session context; reusable skill documents encode project conventions and workflows; protocolized tool interfaces ensure call formats remain correct; and the Harness handles step ordering, output validation, and failure recovery.
The base model can remain completely unchanged; what changes is the representation of the task it faces. This is the core argument of the entire paper: The improvement in Agent reliability comes less from stronger reasoners and more from better-organized cognitive systems. The question for evaluating an Agent system shifts from "How strong is the model?" to "Which burdens have been externalized so the model no longer needs to solve them from scratch every time?"
Future Directions
The paper concludes by pointing out six frontier directions:
Expansion of Externalization Boundaries: Planning goals, verification logic, and orchestration strategies themselves are becoming Harness objects, not just content executed by the Harness.
From Digital to Embodied: Embodied Agents are undergoing the same externalization pattern. The separation between high-level planners and low-latency execution modules is a mapping of externalization logic in physical systems.
Self-Evolving Harness: Using reinforcement learning, program synthesis, or imitation learning to automatically update infrastructure offers broad prospects but simultaneously amplifies governance risks.
Safety and Governance: Novel attack surfaces such as memory poisoning, malicious skill injection, and protocol deception warrant specific attention. Mandatory review gating and provenance tracing are essential safeguards for mature systems.
Shared Infrastructure and Multi-Agent Ecosystems: When memory, skills, and protocols can be shared across Agents, collective learning and division of labor become possible, though this brings governance challenges like infrastructure drift.
Evaluation of Externalization: Existing benchmarks severely lack metrics for infrastructure contributions. New dimensions such as transferability, maintainability, and context efficiency need to be established.
From memory to skills, to protocols, and finally to Harness, the value of this review lies not in listing technical details but in providing a system-level explanatory framework. To summarize in one sentence: Better Agents are not just better reasoners; they are better-organized cognitive systems.
© THE END
For reprint permissions, please contact our official account.
For submissions or media inquiries: liyazhou@jiqizhixin.com