Is Transformer Dead? DeepMind Is Betting on Another AGI Path

"Catastrophic forgetting," a ghost that has plagued the AI community for decades, may have been completely solved this time.

In the past year, AI has advanced by leaps and bounds, and this is no exaggeration. Just the achievements of Google DeepMind in a single year are dazzling:

But if DeepMind were to choose the most important research or product of 2025, the recently hot Nested Learning would certainly have a place.

After reading the paper, a netizen posted that this paper is the "sequel" to "Attention is All you Need."

If the Transformer opened the era of Scaling, then Nested Learning may be opening the true era of AGI.

DeepMind founder Shane Legg is even more direct: the path to AGI is clear, and the latest progress is Nested Learning.

Some netizens even said that if they were to leave a paper for future aliens, it would inevitably be this "Nested Learning."

Swipe up and down to view

If achieving AGI requires 2-3 breakthroughs, continual learning might be one of them, and Google has published multiple related papers.

However, these papers have a common author—

Ali Behrouz, a second-year PhD student in the Department of Computer Science at Cornell University and a research intern at Google Research (New York).

Transformer's Memory Deficiency

In many ways, Transformer performs excellently, capable of Scaling, driving AI across boundaries, and achieving cross-task, cross-domain generalization capabilities.

But Google realized early on that:Transformer is not perfect.

1. Low efficiency in long-context processing

2. Limited abstraction knowledge hierarchy

3. Weak adaptability

4. Lack of continual learning capability

Especially the fourth point, Ali believes that is themost criticalproblem.

When we talk about "Continual Learning," we mean:

There is no training period, no testing period;

During the model's use, it continuously shapes new memories and abstract structures.

Humans are born this way.

But for today's large language models,there is almost no "continual learning" at all.

To illustrate how fundamental the problem is, Ali used a medical analogy:Anterograde Amnesia.

Patients with this condition have a very strange characteristic:

Theirshort-term memory is normal
Theirlong-term memory is still there

But the problem is: 👉Short-term memory cannot be transferred to long-term memory.

Therefore, they live forever in the "present."

New experiences come in and disappear after a while; the world changes, but their brainsare no longer updated.

Now, apply this disease to LLMs.

You will find that large models and human patientsare exactly the same.

Today's large language models, their knowledge mainly comes from two parts:

Long-term knowledge learned during the pre-training phase,

and short-term information in the current context.

But between these two,there is almost no channel at all.

AI models cannot naturally "precipitate" what they just learned into reusable knowledge for the future.

Want it to really learn?

You can only: spend more money, train again, fine-tune again.

This is essentially no different from the state of a patient with anterograde amnesia.

The real problem is not that the parameters are not enough, not that the data is not large, nor is it just that the computing power is insufficient.

The essence of the problem is that between"short-term memory" and "long-term memory,"there is no natural knowledge transfer channel.

If this channel does not exist, so-called "continual learning" will always be just a slogan.

This leads to a core question:How should we build a mechanism that allowsAImodels to precipitate the experiences of the "present" into knowledge for the "future," just like humans?

All AI is "Associative Memory"

If you want AI to truly have the ability for continual learning, you cannot avoid the most fundamental question:

How does the model "remember things"?

Ali's answer is not Transformer, not the number of parameters, but a more primitive and fundamental concept:Associative Memory.

The so-called "Associative Memory" is the cornerstone of human learning mechanisms.

Its essence is to correlate different events or information through experience.

For example, you see a face and immediately think of a name; you smell a certain smell and recall a memory.

This is not logical reasoning, butthe establishment of associations.

Technically, associative memory is key-value pair mapping:

Key: clue
Value: content associated with it

But the key point is that themapping relationship of associative memory is not pre-written, but "learned."

From a certain perspective,the attention mechanism is essentially an associative memory system:it learns how to extract keys from the current context and map them to the most appropriate values to produce output.

If we not only optimize this mapping itself but also let the systemmeta-learnthe initial state of this mapping process, what would happen?

Based on the understanding of associative memory, they proposed a general framework called MIRAS, which is used to systematically design memory modules in AI models.

The core idea of this framework is:

Almost all attention mechanisms, local memory structures, and even the optimizer itself can be regarded as special cases of associative memory.

In order to design a "learnable, nested memory system," we need to make four design decisions for the memory structure in the model:

Memory Architecture
Attentional Bias/Objective
Retention Gate
Learning Rule

This framework can be used tounify the explanationof many existing attention mechanisms and optimizers.

Simply put: MIRASallows us to model, combine, and optimize "memory" as a learning process, rather than just a static module.

Furthermore, optimizers can also be uniformly regarded as associative processes that "map the current gradient to historical information," and they can be re-modeled and generalized.

The optimizer is a kind of "memory module," a key component that enables the model to understand its learning history and make better decisions.

The optimization process and the learning algorithm/architecture are essentially the same concept, just having different contexts (i.e., gradients and data) at different levels of the system.

In addition, they are two interconnected components, where the learning algorithm/architecture generates context (i.e., gradients) for the optimizer. This supports the idea of designing a dedicated optimizer for a specific architecture.

From this, Google's team explored the ways knowledge is transferred between different levels and proposed Nested Learning.

For example:

Assume a recurrent neural network (RNN) with a context length of (L). When processing a text of length (L), the RNN's state will be updated (L) times;

If the outer layer is a module that only updates at the document level (like a pre-trained model), then its update frequency is 1.

Therefore, we can say:RNNis a "fast module,"the pre-trained modelis a "slow module."

By combining modules of different frequencies, we can build a system that can learn at different time scales.

Next, we define what a "nested system" is.

It consists of multipleindependent sub-modules, each with:

Its own parameters (Parameter);

Its own context (Context);

Its own optimization objective (Objective);

Its own gradient flow (Gradient Flow).

These modules form a nested relationship based on the speed of their "update frequency."

If we consider each sub-module as an associative memory system, then the entire model can be viewed as aNested Associative Memory System.

Furthermore, each such associative system can itself be composed of smaller optimization sub-processes, forming a recursive nesting.

After building a nested system composed of multiple levels, the most critical question arises:

How should knowledge be transferred between different layers?

There are several ways of knowledge transfer, and these mechanisms form the basis of "information flow" in the Nested Learning architecture:

Direct Conditioning:The output of the slow layer (outer layer) model is directly used as the input condition for the fast layer (inner layer) model
Non-parametric Conditioning: Does not rely on additional parameters,the model's output directly depends on the context itself. Although there is no explicit parameter connection, the output is still strongly influenced by the inner layer's state.
Transfer via Backpropagation (Gradient-Based Transfer):The gradient itself constitutes the path for knowledge transfer—the high-level judgment of the objective guides the low-level how to adjust parameters in reverse.
Initial State Transfer (Meta-Learned Initialization): The slow layer modelgenerates the initial state of the fast layer model. The outer layer learns an initialization point, allowing the inner layer to quickly adapt to new tasks with a small number of updates.
Weight Generation (Hypernetwork): The slow layer modeldirectly generates the parameters of the fast layer model. This is the essence of the Hypernetwork.

The theory is important, but ultimately, it depends on whether these designs can bring performance improvements in real tasks.

Combining Self-Modifying and Continuum Memory Systems, Google proposed the HOPE architecture under the Nested Learning paradigm.

They applied Nested Learning and the HOPE architecture to multiple task scenarios, especially focusing on two dimensions: "long context" and "continual learning."

Overall, HOPE outperforms or significantly surpasses existing comparison models on multiple core tasks, especially showing clear advantages in continual learning and long context. This reflects the potential of Nested Learning and continuous memory systems.

Swipe up and down to view

What Does This Really Mean?

Nested Learning is not just an architectural framework, but aparadigm for re-understandingdeep learning.

There are also messages from within Google DeepMind: they have already broken through continual learning, but have not released it due to security reasons.

If Nested Learning solves the continual learning capability, it might be the most important thing in the future.

DeepMind's silence might be more deafening than their papers.

Continual learning gives AI a terrifying ability: it no longer just responds to our instructions, but begins to filter what it considers important based on past experience. That is to say, it begins to have "preferences."

If Nested Learning truly solves catastrophic forgetting, then what we have personally opened may not only be a door to AGI, but also an unknown Pandora's box.

What is inside the box, is it a smarter tool, or an opponent that has not only learned to think but also learned to "remember hatred and preference"?

This time, the key is in Google's hands, but the future is in whose hands?

References:

https://www.youtube.com/watch?v=3WqZIja7kdA

https://www.youtube.com/watch?v=uX12aCdni9Q

Is Transformer Dead? DeepMind Is Betting on Another AGI Path

Related Articles

分享網址