A Unified Review of Agents: Harness, Memory, Skills, and Protocols

Hello everyone, I am PaperAgent, not just an Agent!

Reliable Agent capabilities stem not only from internal model parameter weights but, more importantly, from externalizing cognitive burdens into structured infrastructure.

Recently, researchers from Shanghai Jiao Tong University, Sun Yat-sen University, Carnegie Mellon University, and others published a comprehensive paper on the externalization in LLM Agents: a unified review of Memory, Skills, Protocols, and Harness Engineering. 5000 stars: OpenHarness breaks through the Harness barrier

Borrowing from the theory of Cognitive Artifacts: the importance of Agent infrastructure lies not merely in adding auxiliary components, but in transforming difficult cognitive burdens into forms that models can handle more reliably.

Figure 1: Externalization as an organizing principle for LLM Agent design

The arc of human cognitive externalization (from thought → language → writing → printing → computation).
The corresponding externalization arc for LLM Agents: from Weights through three externalization dimensions—Memory (externalized state), Skills (externalized expertise), and Protocols (externalized interaction)—finally arriving at the Harness (the harnessing system). With only scattered experimental logs, Google's PaperOrchestra can write top-tier conference LaTeX papers

2. From Weights to Context to Harness: The Three Migrations of Capability

This section illustrates how research focus has shifted from 2022 to 2026: from Weights (pre-training, Scaling Law) to Context (RAG, long context), and finally to Harness (MCP tool ecosystems, safety, multi-Agent collaboration).

Figure 2: Evolution of community themes across three capability levels

2.1 The Era of Weights: Limitations of Intrinsic Knowledge

Early modern LLM deployments relied almost entirely on model parameters. Pre-training compressed statistical patterns, world knowledge, and reasoning habits into weights. Scaling Laws revealed predictable relationships between parameter scale and performance.

Limitations: Difficulty in updating knowledge (requires retraining), difficulty in auditing (knowledge dispersed across billions of parameters), and lack of personalization (one set of weights serves millions of users without distinction).

2.2 The Era of Context: The Rise of Prompt Engineering

Capabilities began shifting from inside the model to input design. Techniques such as few-shot examples, Chain-of-Thought, and RAG (Retrieval-Augmented Generation) proved that model behavior can be significantly altered through carefully designed context without modifying weights.

Key Shift: Transforming the difficult problem of "recall" (where the model must recover knowledge from parameters) into the simpler problem of "recognition" (where the model only needs to use provided context).

2.3 The Era of Harness: Infrastructure as Capability

As context windows saturate and prompt templates become cumbersome, engineering attention is turning to: "In what environment should the model operate?"

Figure 3: Externalized architecture of a Harnessed LLM Agent

The Harness layer includes: persistent memory storage, tool registries, protocol definitions, sandboxes, sub-Agent orchestration, evaluators, etc. Reliability is increasingly solved by changing the environment rather than prompting the model.

3. Externalizing State: Memory Systems

Memory externalization addresses the burden of an Agent's temporal continuity. Native LLMs are "stateless generators": every call is a fresh context, and continuity must be reconstructed in the prompt.

Figure 4: Memory as Externalized State illustrates the transition from raw context to memory content, showcasing four memory system architectures: monolithic context, retrieval storage, hierarchical orchestration (extraction-consolidation-forgetting-cold/hot swapping), and adaptive memory systems (dynamic modules, feedback-based policy optimization).

Architectural Evolution:

Monolithic Context: All history is retained in the prompt (simple but capacity-limited).
Context + Retrieval Storage: Proximal state in context, long-term trajectory stored externally (RAG mode).
Hierarchical Memory and Orchestration: Introduces explicit extraction, consolidation, and forgetting operations (e.g., MemGPT, Memory OS).
Adaptive Memory Systems: Modules and retrieval strategies respond based on experience (e.g., MemEvolve, MemRL).

Cognitive Artifact Perspective: Memory systems transform "unbounded recall" into "bounded, curated retrieval," altering the task structure the model faces at every decision point.

4. Externalizing Expertise: Skill Systems

Skill externalization addresses procedural burdens. A model might "know" how to complete a task, but reliable execution requires repeatedly building workflows, defaults, and constraints, leading to variance: missed steps, unstable tool use, and inconsistent termination conditions.

4.1 Three Components of Skills

Operational Procedure: Task skeleton (step decomposition, stages, dependencies, stop conditions).
Decision Heuristics: Practical rules of thumb at branch points (what to try first, when to exit).
Normative Constraints: Boundaries of acceptability (test requirements, scope limits, access control).

4.2 From Execution Primitives to Capability Packs

Skill systems have undergone three stages:

Stage 1: Atomic Execution Primitives (e.g., Toolformer) – Stable invocation of single tools.
Stage 2: Large-scale Primitive Selection (e.g., Gorilla, ToolLLM) – Retrieving and selecting from a vast array of tools.
Stage 3: Skills as Packaged Expertise – Packaging operational methods for task categories into reusable units.

Figure 5: Skills as Externalized Expertise illustrates the complete lifecycle of skills: from acquisition (expert writing, distillation from episodic memory, discovery via environmental exploration, combination of existing units) to skill artifacts (operational procedures, decision heuristics, normative constraints), then to activation pipelines (registry discovery, progressive disclosure, composition), and finally runtime execution.

Key Mechanisms:

Progressive Disclosure: Instead of loading full skill documentation at once, it is exposed in layers (name → summary → full guide).
Execution Binding: Skills must be bound to executable actions (tools, APIs, files, sub-Agents) via protocol interfaces.
Compositionality: Skills can participate in higher-order coordination (serial, parallel, conditional routing, recursive calls).

5. Externalizing Interaction: Protocol Systems

Protocol externalization addresses coordination burdens. A raw model might infer that it should call a tool or delegate to a sub-Agent, but without explicit contracts, it must improvise message formats, parameter structures, lifecycle semantics, and recovery behaviors.

5.1 Content Dimensions of Protocols

Protocols externalize the following four dimensions:

Invocation Grammar: Parameter names, types, order, return structure (schema-fication).
Lifecycle Semantics: Coordination rules for multi-step interactions (state machines, event flows).
Permissions and Trust Boundaries: Authorization rules, data flow directions, audit requirements.
Discovery Metadata: Capability registries, capability cards, schema endpoints.

Figure 6: Protocols as Externalized Interaction Top: Evolution from isolated model calls → hard-coded APIs → standardized protocols → Agent Web. Bottom: Harness achieves externalized interaction management through three functional interfaces: Interact (with external APIs/tools), Perceive (sensing environment/context/memory/feedback), and Collaborate (with other LLMs/Agents/humans).

5.2 Overview of Protocol Families

6. Unified Externalization: Harness Engineering

The Harness is the engineering layer that carries the three externalization dimensions (Memory, Skills, Protocols), providing orchestration logic, constraints, observability, and feedback loops to make externalized cognition reliable in practice.

6.1 What is a Harness?

A Harness is not a fourth externalization dimension outside the model, but rather the runtime environment—within which the model operates, perceives, decides, and acts.

Figure 3: Externalized architecture of a Harnessed LLM Agent The Harness is at the center; three externalization dimensions revolve around it: Memory (working context, semantic knowledge, episodic experience, personalized memory), Skills (operational procedures, decision heuristics, normative constraints), and Protocols (Agent-User, Agent-Agent, Agent-Tool). Operational elements (sandboxes, observability, compression, evaluation, approval loops, sub-Agent orchestration) regulate the interaction between the Harness core and externalized modules.

6.2 Six Analytical Dimensions of Harness Design

Figure 7: Harness as a Cognitive Environment

The base model (Agent core) is at the center; six Harness dimensions form a coordination ring: Memory (state persistence), Skills (reusable routines), Protocols (deterministic interfaces), Permissions (sandboxes, file isolation), Control (recursive boundaries, cost caps), and Observability (structured logs, execution traces).

6.3 Harness as a Cognitive Environment

From the perspective of distributed cognition theory, the Harness is not just software infrastructure; it is the environment that shapes effective Agent cognition. It determines what enters the perceptual field, what is retained across sessions, which operations are callable, which actions require approval, and which intermediate states are revisable.

The Harness transforms unbounded tasks into structured environments, redistributing cognitive workload by externalizing memory, formalizing procedures, and introducing explicit control points and constraint enforcement.

7. Cross-Analysis: Inter-Module Coupling

The three externalization modules are not isolated within the Harness but form six key interaction flows:

Figure 8: Coupling between Memory, Skills, and Protocols

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering https://arxiv.org/pdf/2604.08224

Hands-on Design of AI Agents: (Orchestration, Memory, Plugins, Workflow, Collaboration)

Sharing Two Latest Papers on Claude Skills with 3 Core Conclusions

Learning Lobsters are Good Lobsters: OpenClaw-RL 2026: Two Must-Read Year-Opening Reviews for Agentic AI