Albert Einstein once offered a piece of widely known advice: "Never memorize something that you can look up."
In today's AI landscape, this quote feels strikingly relevant.
Over the past two years, the Deep Research Agent has become the hottest track in the tech industry. From OpenAI's Deep Research to Anthropic's Claude with Extended Thinking, AI is evolving from a simple "chatbot" into a "research assistant" capable of independently completing multi-hour investigation tasks.
However, the "memory systems" underpinning these capabilities have taken a path that leads to a dead end: frantically storing past search records. It's like a person trying to get smarter by stuffing every book they've ever read into their living room, only to spend half an hour digging through piles of books every time they need information.
This is not the kind of intelligence Einstein envisioned.
The Three "Silent Killers" of Memory Systems
Let's be blunt: the memory systems of most current Deep Research Agents are essentially filing cabinets.
You ask the AI to research, use tools, and write reports; it saves every operation as a "trajectory." The next time it encounters a similar problem, it digs out a few "most relevant" trajectories from the cabinet to use as references in the prompt.
Does this sound reasonable? In reality, there are three fatal flaws:
First, cost is a money pit. As usage increases, stored memories expand exponentially. Retrieval speeds slow down, and maintenance costs skyrocket. This isn't just a storage issue; it's a scalability crisis.
Second, the AI doesn't truly "internalize" experience. It stores a mountain of memories, yet the model parameters remain unchanged. Mistakes made today are repeated tomorrow if the question is phrased differently. It's like a student copying their mistake notebook ten times without ever truly understanding where they went wrong.
Third, it relies heavily on human supervision. To teach the system that "this path is good, that path is bad," someone must provide the correct answer. In the real, open world, where are there so many ready-made standard answers?
When these three issues combine, they create an awkward situation: the more complex the memory system becomes, the worse it often performs. For the AI, that pile of "historical records" may just be noise.
A "Tri-Brain" Architecture
A team from East China Normal University has proposed a new prescription for this dilemma: split the memory system into three roles, letting each do what it does best.
This framework, called MIA (Memory Intelligence Agent), abandons the traditional "single memory bank" design in favor of a Manager-Planner-Executor tri-architecture.
Manager (Memory Administrator): Instead of storing raw records, it stores only "compressed workflow paradigms." Think of a teacher writing a lesson plan: they don't copy the entire textbook, but retain only the core teaching framework.
Planner: This is a parametric model dedicated to "thinking" about what to do. It is not a search assistant, but a decision-making hub that internalizes historical experience into strategic capability.
Executor: It follows the plan, does the actual work, and handles interactions with external tools.
There is no common tug-of-war seen in "Search Agents" here. The Manager provides experiential references, the Planner decides how to proceed, and the Executor handles execution. Roles are clearly defined and decoupled.
But the real breakthrough lies in a loop:
Non-parametric memory (Manager) and parametric memory (Planner) can undergo bidirectional conversion.
When the Planner successfully completes a new task, the successful experience is compressed into a workflow and stored back in the Manager; effective paradigms in the Manager can then train the Planner's parameters via reinforcement learning. This isn't simple storage shuffling; it is the continuous internalization and reconstruction of cognition.
Like a researcher: the Manager is their literature management system, the Planner is their brain, and the Executor is their lab assistant. Only when the three work together can research capabilities grow with each project.
In the Gaps of Reasoning, the Model Suddenly "Enlightens"
But this isn't even the most counter-intuitive part of MIA.
Traditional AI works like this: train on a bunch of data → model freezes → deploy for inference. Once online, parameters are locked, leaving it helpless against new problems.
MIA does something "highly irregular": it updates its own parameters during the inference process.
They named this mechanism Test-Time Learning (TTL).
How does it work? When the model faces a new problem:
Generate multiple different solutions simultaneously (Plan 1, Plan 2, Plan 3...) Run each one to see the results Reward itself for good schemes, punish itself for bad ones Update the Planner's parameters, then continue solving the problem
The entire process is completed within the flow of solving the immediate problem, requiring no extra offline training cycles and no service interruption.
In other words, MIA's Planner is learning while answering your questions. The more it answers, the smarter it gets.
This addresses the most authentic pain point of Deep Research Agents: continuous evolution after deployment. It's no longer a case of "peak performance at launch, followed by a forgetting curve decline."
Unsupervised Self-Evolution? They Simulated "Academic Peer Review"
But a thornier issue remains: in an open world, who judges the quality of these solutions?
If every instance requires humans to label the correct answer, the system still cannot be scaled.
The MIA team used a design that is seemingly circuitous but actually clever: simulating the "peer review" mechanism of academic conferences.
They arranged three "AI reviewers" to check different dimensions: whether the logic chain is coherent, whether information sources are reliable, and whether the task was truly completed. Then, a "Domain Chair" synthesizes the three opinions to give a final A/B verdict.
The most interesting part of this system is: it doesn't need to know what the standard answer is. As long as the process possesses "strict logic + credible sources + minimal hallucination," it counts as a good learning signal.
Is this supervised or unsupervised? Even NYU statistician Hald might not be able to answer. But one thing is certain: it allowed MIA to complete self-training without any human annotation, with performance steadily improving: from the first round to the third, multiple metrics continued to climb.
7B Parameters Defeating 32B: What This Number Means
After discussing concepts, let's look at hard numbers.
MIA evolved comprehensively across 11 benchmark tests. The most eye-catching result is this:
Using Qwen2.5-VL-7B as the executor, it achieved an average improvement of 31% across 7 datasets, defeating the Qwen2.5-VL-32B (which has nearly 5x more parameters) by a margin of 18%.
Why can a small model defeat a large model?
Because the intelligence of memory architecture is surpassing the sheer advantage of parameter scale. MIA gives a 7B "body" the "experience learning ability" that previously might have required 70B.
Another noteworthy number: compared to GPT-5.4, MIA improved by 101% on the LiveVQA task (from 21.5 to 43.1). LiveVQA is a benchmark for real-time information Q&A, heavily testing a system's dynamic knowledge acquisition capability. This indicates that MIA's collaborative architecture surpasses pure pre-trained large models when it comes to "rolling up sleeves to look up information."
Conversely, traditional "long-context memory" methods (like RAG, Mem0) performed even worse than the "no-memory" baseline. This isn't because the technology was implemented incorrectly, but because the mindset was wrong—remembering more does not equal becoming smarter.
What is the Essence of AI Memory?
On the last page of the paper, the authors quote Einstein's那句话 again as a footnote to the entire project.
This is no coincidence. MIA's core insight is precisely this: let AI remember the path of "how to learn," rather than the "textual content of every search."
Traditional methods pile up memories, like a student frantically copying notes. MIA's method establishes "metacognition"—the ability to learn how to learn.
However, this architecture comes with a price. Test-Time Learning requires generating and verifying multiple plans simultaneously, making inference costs at least 3-4 times higher than conventional methods. The Manager needs to reside in VRAM, and a 32B "brain" implies significant hardware overhead.
Therefore, MIA is not suitable for real-time scenarios requiring "instant replies." It is better suited for tasks requiring deep research where minute-level waiting is acceptable: writing a financial report analysis, completing a competitor study, or diagnosing complex system failures.
In these scenarios, the trade-off of inference "cost" for "quality" is worthwhile.
Final Thoughts
Deep Research Agents are at a critical stage, moving from "usable" to "excellent."
Pipeline-style Memory RAG is hitting a ceiling, while agents with self-evolution capabilities are opening up new possibilities.
The significance of MIA lies not in how many points it scored, but in the new paradigm it provides: don't memorize everything, but learn how to process information; don't rely on human supervision, but learn to self-evaluate; don't just learn during training, but evolve during inference.
Perhaps this is what Einstein truly wanted to tell us: True wisdom lies not in the stock of knowledge in your brain, but in the incremental ability to acquire, process, and internalize new knowledge.
To this extent, AI may finally begin to possess "wisdom" similar to that of humans.
Advanced Learning
If you want to systematically master frontier technologies and applications of multimodal large models, I recommend my premium course.
The course covers mainstream multimodal architectures, multimodal Agents, data construction, training workflows, evaluation, and hallucination analysis, accompanied by multiple practical projects: LLaVA, LLaVA-NeXT, Qwen3-VL, InternLM-XComposer (IXC), TimeSearch-R video understanding, etc. It includes algorithm explanations, model fine-tuning/inference, service deployment, and core source code analysis.
The course is currently being updated. You can participate in the learning via my personal website or Bilibili classroom:
Bilibili Classroom (Click "Read Original Text" in the bottom left corner to jump directly): https://www.bilibili.com/cheese/play/ss33184
Official Website Link (Requires scientific 上网 for access within China): https://www.tgltommy.com/p/multimodal-season-1