Your Agent Isn't Really Learning—It's Just Flipping Through a Notebook

MemGPT, RAG, Reflexion, Voyager—virtually every mainstream agent framework today treats "retrieval" as the default memory mechanism. A Reflexion agent accumulates thousands of self-reflections, which sounds like "growth," but every time a new session starts, the underlying model's weights remain untouched. Its filing cabinet gets bigger, yet its capabilities see no improvement whatsoever.

This paper from the Chinese University of Hong Kong and Zhejiang University directly characterizes this phenomenon as a "category error": mistaking a memo for a brain. The paper argues a core point from three dimensions—cognitive science, formal proof, and security—that all currently deployed agent memory is merely exemplar-based lookup, not true memory (weight-based memory). This confusion imposes provable capability ceilings, a perpetual "frozen novice" dilemma, and escalating security risks.

Memos and Brains: Two Structurally Different Paths

The paper proposes a concise analytical framework: all techniques that alter the output of an LLM agent essentially follow only two paths.

(1) Changing θ: Modifying model weights via gradient updates from pretraining, fine-tuning, reinforcement learning, etc., changes the prior distribution P(X|θ). Knowledge is compressed into weight space, allowing the model to recombine rules to handle unseen inputs—this is "generative."

(2) Changing C: Injecting content into the context window through prompts, RAG, MCP tool calls, scratchpads, etc., makes the generation conditional upon P(X|θ, C). Knowledge is compressed as text, with capacity limited by the context length L; the model can only use content explicitly present in the context—this is "retrieval-based."

All currently deployed agent memory is C-engineering. MemGPT, RAG, Reflexion, Voyager—without exception. Before and after an agent's experiences, the model weights remain identical.

[Table 1: LLM Agent Memory Classification] The paper classifies memory into four types: Working (context window, current session only), Episodic (external storage, cross-session, exemplar-based generalization), Semantic (model weights, permanent, rule-based generalization), and Experiential (model weights, permanent, updated via fine-tuning/continual learning). All current agent systems occupy only the Episodic row; the Experiential row is systematically missing in deployed systems.

Diagram classifying memory types into working, episodic, semantic, and experiential, highlighting that current agents only use episodic memory

The paper uses a precise analogy: a person records their lessons in a diary and can only recall them by flipping through the diary and when the context is sufficiently similar; whereas someone who has truly internalized those lessons can draw upon them anytime, anywhere. Current agent memory is the former.

A Provable Generalization Ceiling: Retrieval Can Never Match Weight Learning

The paper's core theoretical contribution is a "Compositional Sample Complexity Separation Theorem." Suppose a domain has k base concepts, and a composition operator ⊕ maps concept pairs to the correct output. The paper proves that:

For a retrieval system to achieve 1−δ accuracy on compositional novel tasks, the number of compositional examples it needs to store is nR = Ω(k²)—essentially requiring an exhaustive coverage of all concept pairs. In contrast, a parametric system (with fine-tuned weights) needs only nP = O(d/δ) examples, where d is the VC dimension of the compositional operator. The sample complexity ratio between the two is Ω(k²/d). When d=O(k), the ratio is Ω(k); when d=O(1), the ratio skyrockets to Ω(k²).

A key assumption of this theorem is that a frozen model, given K context examples, has an accuracy α̅ < 1 on unseen compositional pairs. The paper proves in its appendix via Fano's Inequality that as long as the complexity of the operator class satisfies log|H| > K·log|Y|, this assumption is itself a theorem, not merely an empirical guess.

Increasing the context window cannot eliminate this gap. A longer context can only marginally improve α̅, but the Ω(k²) coverage requirement persists. Experiments by Paulsen [2026] show that even when a model supports a 128k token context, the effective utilization saturates around 20k tokens.

Empirical evidence aligns with theoretical predictions: Ovadia et al. [2024] found that RAG excels at recalling rare entities but cannot improve compositional reasoning beyond a base model's capabilities; Yao et al. [2026] directly compared storing reflective experiences externally versus encoding them into weights—parametric storage comprehensively outperforms external storage, and the gap widens precisely when transferring to unseen problem types, perfectly matching the theorem's predictions.

The Frozen Novice and Escalating Security Vulnerabilities

The generalization ceiling is a static limitation, while the "Frozen Novice Problem" describes the dynamic consequences: each session starts from the same frozen weights, so the agent forever executes .predict(C) and never executes .train(). One of the most robust findings in cognitive science is that the difference between experts and novices lies not in how many cases they've stored, but in the structural reorganization of their knowledge representation—physics novices categorize problems by surface features ("inclined plane problem"), while experts categorize them by deep principles ("energy conservation problem"). A pure retrieval agent can never complete this reorganization.

The security dimension is equally severe. Without persistent memory, a single prompt injection only affects the current session. With agent memory, injected content is written to storage and retrieved in every subsequent session—a one-time attack becomes a permanent compromise. The MINJA attack achieved a 98.2% injection success rate, with injected instructions persisting across sessions and having minimal impact on normal functions; PoisonedRAG demonstrates that just 5 pieces of adversarial text per target query can achieve a 90% attack success rate on a knowledge base containing millions of entries. As the number of interactions N(t) grows, the probability of being compromised P(compromised by t) = 1−(1−p₀)^N(t) approaches 1.

The Way Out: Building a "Consolidation Pathway"

The paper's prescription isn't to discard retrieval, but to add the missing other half. Biological intelligence's solution is the CLS (Complementary Learning Systems) theory: the hippocampus handles rapid episodic storage, while the neocortex consolidates it into slow, distributed, rule-based representations during sleep. Current AI agents have only implemented the hippocampal half.

The paper issues clear demands to three communities. For system builders: a consolidation pathway from episodic storage to model weights must be established; specific mechanisms can include periodic fine-tuning, knowledge editing, test-time training, or self-distillation from traces; the key is that this pathway exists and operates asynchronously. The necessary building blocks—LoRA, SSR, MEMIT, TTT layers—already exist; this is a design choice, not a feasibility barrier. For benchmark designers: they should measure CGT (Compositional Generalization over Time)—does an agent's ability to handle new concept combinations improve with experience? A pure retrieval agent's performance on this metric should be flat. For the continual learning community: the agent scenario happens to provide everything continual learning methods need—a natural stream of experiences with reward labels, clear generalization criteria, and immediate practical value.

Agents haven't gotten better memories; they've just gotten better filing systems. Better filing doesn't make an expert; weight consolidation does.

📄 Original Title

Contextual Agentic Memory is a Memo, Not True Memory

🔗 Original Link

https://arxiv.org/abs/2604.27707

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.