How to Solve Context Explosion? Let Agents Actively 'Forget' Like Biological Organisms

When Agents handle complex software engineering tasks, they often fall into an embarrassing dilemma: as the interaction history continues to grow, the computational cost explodes, response latency increases linearly, and the model is also interfered with by past irrelevant error information, leading to a severe degradation in reasoning ability. This phenomenon is known as "Context Bloat."

The paper proposes a key insight: most existing solutions rely on passive external summarization mechanisms, and the agent itself cannot control when to compress or what to compress. Can we make the agent actively manage its own "memory," like a biological organism?

Inspiration from Slime Mold Foraging

The paper draws inspiration from a slime mold called Physarum polycephalum. When exploring its environment, this organism physically retracts from dead ends while leaving chemical markers to avoid repeated exploration. Biological systems do not retain a complete record of every muscle movement during maze navigation; they only retain the "learned map."

Similarly, an agent exploring a codebase does not need to remember the output of the 50-line ls -R command from ten minutes ago; it only needs to remember the conclusion that "the configuration file is not in the /src directory."

Based on this analogy, the paper proposes the Focus Agent architecture. This architecture introduces two core primitives: start_focus and complete_focus. The key point is that the agent has full autonomy over when to call these tools—no external timer or heuristic rule forces compression.

The Four Stages of the Focus Cycle

The workflow of the Focus architecture consists of four stages:

(1) Start Focus: The agent declares what it is investigating (e.g., "debugging database connection"), which marks a checkpoint in the conversation history.

(2) Explore: The agent uses standard tools (read, edit, run) to perform work.

(3) Integrate: When the agent naturally completes a subtask or encounters a dead end, it decides to call complete_focus, generating a summary that includes what was attempted, what was learned (facts, file paths, bugs), and the results.

(4) Retract: The system appends the summary to the persistent "knowledge" block at the top of the context and deletes all messages between the checkpoint and the current step.

[Figure 2: Conceptual sawtooth pattern of context growth]

The paper shows that Focus (blue) exhibits periodic compression (decreases), while Baseline (red) grows monotonically. Through aggressive prompting, Focus compresses every 10-15 tool calls, preventing context bloat while retaining learned results in the persistent knowledge block.

This design transforms the context from a monotonically increasing log into a "sawtooth" pattern—growing during exploration and contracting during integration. The model controls this cycle based on task structure, rather than arbitrary step counts.

Experimental Setup and Optimized Scaffolding

The paper evaluates the Focus architecture on the SWE-bench Lite benchmark, which is used for software engineering agents to solve real GitHub issues. The paper uses the claude-haiku-4-5-20251001 model and conducts controlled A/B comparison experiments on N=5 context-intensive instances.

Following the SWE-bench best practices reported by Anthropic, the paper implements a minimal dual-tool scaffolding: Persistent Bash (a stateful shell session with working directory and environment persisting across calls) and String Replacement Editor (target file editing via precise string replacement, avoiding error-prone full-file rewrites).

Initial experiments showed that passive Focus prompts only produced 1-2 compressions per task, saving only 6% of tokens. The paper revised the Focus prompt to be more directive: enforcing the workflow requirement to "always call start_focus before any exploration... always call after 10-15 tool calls"; injecting reminders after 15 tool calls without compression; explicitly guiding the use of 4-6 focus stages (explore → understand → implement → verify).

[Table I: A/B comparison on SWE-bench Lite (Haiku 4.5, N=5 hard instances)]

The paper compares Baseline and Focus on metrics such as task success rate, total token consumption, average tokens per task, average number of compressions, and average number of discarded messages.

Core Finding: 22.7% Savings with Same Accuracy

Experimental results show that Focus achieved a 22.7% reduction in total tokens (14.9M → 11.5M) while maintaining the same accuracy as Baseline (3/5 = 60%). This contrasts with the paper's earlier experiments, where passive prompts showed a decrease in accuracy. The key difference is that aggressive prompting enforces frequent, structured compression (6.0 per task vs. 2.0 previously), preventing the context from being contaminated by stale exploration logs.

[Table II: Per-instance results: Token savings vs. accuracy]

The paper presents detailed comparisons for five instances, including token changes and compression counts for matplotlib-26020 (-57%), seaborn-2848 (-52%), pylint-7080 (+110%), pytest-7490 (-18%), and sympy-21171 (-57%).

Focus reduced tokens in 4 out of 5 instances, with savings ranging from 18% to 57%. The strongest savings occurred in instances requiring extensive exploration: matplotlib-26020 (-57%, 4.0M → 1.7M), seaborn-2848 (-52%, 3.4M → 1.6M), and sympy-21171 (-57%, 1.6M → 0.7M).

[Figure 1: Token consumption (Haiku 4.5, N=5 hard instances)]

The paper shows the total token consumption comparison for 5 hard instances. Focus reduced usage by 22.7% through aggressive model-controlled compression while maintaining the same accuracy.

Case Study: Maximum Savings vs. Compression Overhead

On matplotlib-26020, both agents passed the test suite, but Focus achieved a 57% token saving (4.0M → 1.7M). Focus compressed 5 times in 71 LLM calls, while Baseline used 102 calls without compression. The savings came from Focus efficiently summarizing its exploration phase—once it located the relevant files and understood the bug, it compressed that context and proceeded directly to implementation.

However, on pylint-7080, Focus used 110% more tokens than Baseline (4.3M vs. 2.1M), even though both agents passed the test suite. Analysis shows Focus performed 136 LLM calls vs. Baseline's 63, with 8 compressions discarding 80 messages. This problem required extensive trial and error, and Focus's compression occasionally discarded useful context, forcing re-exploration. This indicates that compression is not universally beneficial: tasks requiring iterative refinement may suffer from aggressive context pruning.

The Cognitive Tax and Limitations of Compression

Active compression introduces a "cognitive tax"—the token cost of generating summaries and the overhead of managing focus stages. Despite this tax, Focus achieved a net 22.7% token saving in the experiments. This tax is amortized over the task lifecycle: each compression costs a few hundred tokens, but saves thousands by not reprocessing stale history.

The paper identifies several limitations: the sample size is only N=5 hard instances, requiring validation on the full SWE-bench Lite benchmark (N=300); task-dependent benefits—Focus shows 50-57% savings on exploration-intensive tasks but a 110% overhead on one iterative refinement task; only Claude Haiku 4.5 was evaluated, and performance on other models is unknown; results depend on the optimized dual-tool scaffolding.

Final Thoughts

The paper demonstrates that aggressive, model-controlled context compression can achieve significant token savings without sacrificing task accuracy. Current LLMs seem to lack intrinsic cost awareness—they do not naturally optimize token efficiency, and a scaffolding that makes compression a first-class citizen in the workflow is needed.

Future work directions include: validating on the full SWE-bench (N=300) to characterize task-type dependencies; fine-tuning or reinforcement learning methods to internalize compression heuristics without explicit prompting; structured compression to preserve specific artifacts (test outputs, diffs) rather than free-text summaries; cross-model evaluation (GPT-4, Gemini, open-source models) to assess generalizability.

As context windows grow and agent tasks become more complex, active compression will become increasingly valuable for managing the inherent quadratic cost growth in autoregressive reasoning.

Paper Title: Active Context Compression: Autonomous Memory Management in LLM Agents
Paper Link: https://arxiv.org/pdf/2601.07190

How to Solve Context Explosion? Let Agents Actively 'Forget' Like Biological Organisms

Related Articles

分享網址