Google Just Overturned Model Memory, and Nvidia Revolutionized Attention｜Hao Good Chat Paper

Tencent Technology Paper Interpretation Column, finding AI certainty at the intersection of code and commerce.

Text by Bo Yang

Edited by Xu Qingyang

Recently, Google's Nested Learning triggered a memory earthquake in the model world.

Many people have come to realize that large models don't have to be read-only weights that are sealed after training. They can also continue to change during inference. In Nested Learning, when the model reads new context, it doesn't just stuff the text into the attention cache for temporary lookup; instead, it allows itself to change parameters during the inference process, making new information part of its internal memory.

But while people were still digesting this idea, Nvidia gave a more radical answer on December 28, 2025, with a paper titled "End-to-End Test-Time Training for Long Context." Google's memory-enhanced route is still working on solving the memory problem, preserving important past information more completely. But Nvidia's researchers believe that memory is essentially learning, and "remembering" is "continuing to train."

Just as people don't remember the exact words from elementary school texts, but the feelings evoked by articles like "The Monument" deeply shape our later values.

Nvidia and Stanford researchers believe AI should work the same way.

01 Using Learning to Replace Attention-Based Memory

If you look back along the timeline, you'll find that TTT (Test-Time Training) is not an invention that appeared out of thin air.

As early as 2013, Mikolov et al. tried dynamic evaluation in language models. At the time, they unfroze the model and performed small-step gradient updates using the cross-entropy loss (CE) for next-word prediction on the test text—essentially the parameter learning loss objective most commonly understood in large language models—to adapt the parameters to the current style, topic, and local statistical patterns. Krause et al. in 2018 refined this to be more systematic and feasible.

In other words, in the early days of large language models, people already discovered that adjusting parameters during inference does not violate the basic logic of language modeling and can even bring benefits.

When analyzing Nested Learning, everyone discussed the innovation in memory. But few noticed its potential to replace the attention layer in the context of long contexts. The emergence of TTT-E2E makes this possibility even more explicit.

The brilliance of the Transformer over the past decade has been largely built on the attention mechanism. It turns every sentence read into an index (KV Cache), and every time it answers a question, it has to go back and precisely flip through old books. This mechanism is precise but very memory-intensive. Therefore, various improved approaches like group attention and linear attention have emerged, attempting to compress memory usage and increase the model's context length.

The TTT scheme, on the other hand, directly abandons the approach of "internalizing" (weight updates) knowledge to solve the problem of context processing. Regardless of how long the context is, its inference state size and computational cost remain constant.

Therefore, in the TTT family, no matter how the context grows, its Latency (generation delay) will not change at all.

This is the core capability brought by TTT that can replace attention during the inference stage: remembering nearly infinite context with zero latency.

However, the dynamic evaluation line has never truly become a mainstream deployment paradigm. This is because it was still immature in engineering at the time and was difficult to use effectively. The main gap lies in the inability to align the training and inference stages.

The training stage optimizes for "out-of-the-box performance with frozen parameters," but does not incorporate "performing several updates during inference" as part of the model's behavior into the objective function. This leads to instability in engineering practice. Without constraints, continuous updates can lead to catastrophic forgetting (learning new things and forgetting old ones), parameter drift (model parameter distribution becoming strange), and overfitting to anomalous segments (repeating strange phrases) as default risks.

The main mitigation methods in early approaches were "small learning rates, few steps, and frequent resets." They allowed the system to be barely usable but also locked TTT into the scale of "short-term adaptation," making it difficult to develop into true long-term memory.

What Nested Learning / Titans did was make this logic feasible at the architecture level. By separating layers with different update frequencies and allowing each layer to update independently, it stabilized parameter updates. This also allowed TTT to evolve from short fine-tuning to long-term internal memory. Therefore, we can say it brought a stable long-term memory update method.

However, this comes at a cost. In the paper, Nvidia classifies Nested Learning and Titans under TTT-KVB. Because their update objectives are actually somewhat different from traditional TTT. They are more like teaching the model "how to store" rather than directly teaching it "how to predict."

We all know that the ultimate goal of large language models is "predicting the next token," which is the original learning objective. The update objective of Nested Learning is typically to make the model reconstruct the corresponding value from a compressed representation (like a key), or to make the hidden state evolve consistently within the layer, all to build an internal memory structure that can be quickly indexed. This can indeed indirectly help the language model complete tasks, because better internal associative memory may lead to better predictions. But there is always a layer of distance between it and the ultimate goal.

The TTT-E2E proposed by Nvidia is more like the original dynamic evaluation. Its test-time update objective is the next-token prediction cross-entropy CE at the end of the entire network. To achieve a single objective, this method is end-to-end, not layered, and only updates this one CE from start to finish. When the loss function is the final task itself, anything the model learns in the context more directly optimizes subsequent predictions. It is completely aligned with the model's ultimate goal.

To illustrate this difference clearly, they designed a "toy model" in the paper, removing all self-attention layers from the Transformer and leaving only multi-layer perceptrons (MLPs). This essentially downgrades the model to a "bigram model" that can only remember the previous word. In this setup, any long-term memory capability cannot come from attention or cache; it can only come from "updating weights during testing and compressing the context into parameters."

During testing, they let the model continuously practice when reading x1, x2, x3,...: using xt-1 to predict xt, calculating CE, and performing one small-step gradient descent on this loss.

This is like an explorer who can only see one meter ahead, guessing the next step based on the step just taken. And you need to traverse a 10-kilometer cave (going through all the context and making changes).

With each step, you first predict: "Based on my sense of direction, should I see rocks or puddles next?"
Then take a step and see if the prediction is correct.
If it's wrong, adjust your posture and stride (gradient update).
In the cycle of "predict-correct-adjust," you change your "muscle memory" (weights).

By the 1000th step, although you can't see the boulder at the first step, the information about that boulder has been encoded in your current gait, center of gravity, and sense of direction. It has been passed down through 999 "predict-correct-adjust" cycles and integrated into your body.

As a result, this model without any attention cache, relying on the objective of "training to predict the next word," saw its loss curve (blue) decrease rapidly as the reading length increased. It almost closely followed the curve of the full-attention Transformer (orange line).

This means that it perfectly encoded context information solely by modifying its own neural network parameters (MLP weights), achieving an effect almost identical to storing all characters (Full Attention).

In contrast, the design intention of TTT-KVB is to serve as a direct replacement for the self-attention layer. Its core idea is still "key-value binding." That is, although it does not use the traditional attention mechanism to store KV Cache, it attempts to use a neural network to learn the mapping relationship between Key and Value.

This is like hoping to draw every stone in the cave on a map for easy retrieval. Even information unrelated to exiting the cave, like the texture of the boulder, will be drawn in. Its training efficiency is relatively slower.

The transitional experimental results in the paper prove this point. Researchers replaced the intra-layer key-value binding objective of TTT-KVB with the end-to-end next-token prediction objective, and the evaluation loss for language modeling decreased significantly.

From the experimental data, this change indeed brought substantial improvement. On a 760M parameter model, TTT-KVB had a loss of 2.818 in an 8K context, while after simplifying the version and using the next-token prediction loss (TTT-E2E all layers MH), the loss dropped to 2.806.

This improvement of 0.012 is actually a significant gap in language model evaluation. This shows that after the end-to-end transformation, the model is indeed more confident and better at predicting the next token. And long-context capability can truly be obtained purely through test-time learning, without relying on attention cache.

Under this logic, memory is no longer designed as a set of storage structures but is redefined as a continuous learning process. The value of memory does not lie in how completely the past is preserved, but in whether it can change your next judgment.

However, the problem with past dynamic evaluation was the lack of a stable engineering model. Since we want to use the same idea, how does TTT-E2E overcome these issues?

This is exactly the second thing Nvidia needs to do next: use meta-learning and a complete set of engineering safeguards to make this end-to-end test-time learning a stable and scalable context memory system.

02 The Echo of Meta-Learning and Engineering Stability

Meta-learning, this concept and practice actually appeared very early. One explicit meta-learning idea has been inherited until the release of Deepmind DiscoRL last year.

This is the MAML system proposed by Finn in 2017. It is nested with two loops: the inner loop is responsible for adaptive learning (gradient descent), and the outer loop is responsible for making adaptive learning more effective (learning the gradient of the gradient). In this way, the outer loop is more like a reflection on the inner loop steps, through which it can learn how to learn efficiently.

What TTT-E2E does is exactly to use this meta-learning system to help it stabilize end-to-end data.

Nvidia's researchers believe that the main problem with past dynamic evaluation lies in the "training-test mismatch." If we only train a frozen language model in the traditional way, and then suddenly require it to update parameters while reading during testing, the overall system will definitely be unstable, and catastrophic drift and forgetting are common. Therefore, the training stage must include the learning process of the inference stage, so that the model is accustomed to continuing to learn during inference when it leaves the factory.

This is when meta-learning comes into play. It helps the model learn how to update itself during training so that it can better answer subsequent questions. Specifically, it uses meta-learning to let the model find the initial parameters W0 that are most suitable for updates during inference.

Writing it as a more intuitive process is two loops nested together.

Inner loop: When the model reads a context, it gives a guess for the next word. Then immediately compares it with the actual next word that appears and updates its parameters. This is consistent with the training of traditional next-token prediction models.
Outer loop: During the training stage, it repeatedly simulates the "on-the-job state" for the inner loop. It gives the inner loop model many segments of text, allowing it to make several small corrections in the same review manner, and then checks whether the inner loop's subsequent predictions are indeed more accurate and stable after the corrections. Only when the parameter updates of the inner loop truly bring benefits does the outer loop reward it; if such updates cause drift or forgetting, the outer loop punishes it. Over time, the model learns a more suitable factory state. With these initial parameters to go on duty, the inner loop's small corrections (gradient updates) are not easy to damage itself.

The teacher in the outer loop learns which directions of gradient updates are stable during test-time updates (preventing gradient explosions), which updates can quickly absorb context rules without destroying general capabilities (preventing catastrophic forgetting), and which initializations allow the same learning rate and the same number of steps to produce more reliable benefits (improving training efficiency). Then, all of these are integrated into the model's initial parameters.

A meta-learning directly lets the model solve the core engineering dilemma, making the end-to-end model possible.

But this is only a possibility, not yet stable. To further ensure engineering feasibility, TTT-E2E still made multiple compromises in engineering as safety valves.

The first safety valve is mini-batch processing and sliding window attention. Theoretically speaking, updating parameters every time a token is read during testing is the finest-grained and most perfect online learning, but its cost is too high. However, if the token batch given to it is too large, the model has no short-term memory at all, so it cannot remember the main tokens in a batch before updating, and the gradients will become increasingly wrong.

Therefore, TTT-E2E makes the batch size relatively small on the one hand. On the other hand, it retains sliding window attention as a method of short-term memory. Window attention is like a flashlight, ensuring that you can at least see the recent context within an update block, so that predictions within the block do not collapse.

The paper explicitly proposes a specification for window size and batch size, i.e., the window size k should preferably not be less than the block size b for test-time updates; otherwise, you will revert to a "locally amnesic" model within the block.

The second safety valve aims to prevent catastrophic forgetting. They did not aggressively change all layers to TTT layers. Instead, they froze the embedding, normalization, and attention layers, and only updated the MLP. Moreover, they did not update the entire neural network each time, but only the last 1/4 of the blocks.

This way, the underlying general language capabilities and the read/write channels of attention remain unchanged, and TTT only acts as a controllable learning module in the upper layers. To further prevent online updates from erasing pre-trained knowledge, they also added a static second MLP in the updatable blocks. One MLP is responsible for writing the current context, and the other is responsible for preserving the factory capabilities.

This is structurally isolating a region for catastrophic forgetting. Parameters can drift and erase past memories, but they can only drift within a designated writable area.

When these components are assembled, TTT-E2E finally achieves the unfinished goal of the earliest version of TTT, bringing it a complete engineering body.

So, how are its results?

03 Proving Itself with Loss

When looking at model training effectiveness, the most important thing is to observe the model's loss changes. Loss refers to the average loss of the language model on the next-word prediction task, generally the size of the cross-entropy CE mentioned above. The smaller it is, the more accurate the model's predictions.

In terms of memory, we look at the change of loss in the context. If the loss continues to decrease in a longer context, it indicates that the model is indeed using earlier information and predicting better. Conversely, if the context becomes longer but the loss does not decrease or even increases, it means the information is remembered but not used, which is like learning without thinking.

In this regard, the advantage of TTT-E2E is very obvious. When the context is extended to 64K and 128K, other types of architectures, such as Mamba 2 and Gated DeltaNet (linear-time models), start to fall behind. Even TTT-KVB cannot pull the curve back in longer contexts.

Only the line of TTT-E2E is almost pinned, with no sign of advantage dilution from 8K to 128K. This shows that for others, the longer the context, the harder it is to learn, while TTT-E2E becomes better at using the context the more it runs.

Moreover, it also inherits the biggest advantage of learning parameters, which is cost compression. If full attention is used, the longer the context, the higher the prefill latency will be, because it has to scan a longer history every step it generates. In contrast, the latency of SWA, RNN/SSM, TTT-KVB, and TTT-E2E is almost flat. It relies on learning it in, rather than constantly looking at the old context to process the new one. On H100, when prefilling 128K, TTT-E2E is about 2.7 times faster than full attention.

Another aspect is the convergence speed of the loss. The faster the loss converges, the more efficiently the model learns. Under both 32K and 128K lengths, TTT-E2E is the only method that can outperform full attention throughout the context range, and a large part of its overall advantage comes from earlier positions in the sequence.

This is exactly where "learning rather than storing" excels. The model does not wait until the end to retrieve a certain detail from memory; instead, it pushes the model towards a parameter region more suitable for the next prediction from the very beginning with each segment of context. It is not just memorizing, but also forming reading habits more suitable for this book while reading.

Of course, this method is not all-encompassing. TTT-E2E is still crushed by full attention in tests that require precise retrieval, such as finding a needle in a haystack. Linear routes, including TTT-E2E, have not performed well in long-context retrieval.

This is not contradictory. When memory is defined as "the prediction gain brought by learning," it is more like compression and summarization rather than verbatim archiving. For tasks like writing coherence, long-text understanding, and style constraints, this compression is very cost-effective. Using learning compression to exchange for the scalability of long contexts allows the model to run efficiently and economically at a scale like 128K, while also becoming genuinely better at predicting.

This is one of the core meanings of TTT.

Another factor that may restrict the implementation of this architecture is training cost. Even with various optimizations, TTT-E2E's training latency is still 50-100% higher than that of standard Transformers. This is acceptable at the scale of academic research, but when scaled to industrial-level training of trillions of tokens, this additional cost is slightly high.

04 Returning to the Original Learning May Better Meet the Expectations of Continuous Learning

The significance of the Nested Learning revolution is to once again bring "inference-time updates" from past silence into current discussion, allowing continuous learning to find a new focus.

The significance of TTT-E2E is not just another long-context solution, but redefining memory itself. Memory is not about moving the past into the present, but about letting the past change the future.

Today, as attention mechanisms approach physical limits due to quadratic costs, this route of "learning information into parameters" may be the only engineering answer that allows models to truly grow continuously from million-token contexts.

In an era where context windows are getting longer, information is increasing, but people are increasingly unwilling to pay the quadratic cost of traditional attention, this route of treating memory as learning and learning as compression may become one of the most realistic engineering answers for continuous learning for quite some time.

It may not be omnipotent, but it is closer to our essential expectations of intelligence than any current memory solution: "Not to remember everything, but to be able to learn from everything to become smarter."

Recommended Reading

Nuclear Fusion or Space Photovoltaic? A Silicon Valley Energy Route Debate is Unfolding

Power Shortage, Power Shortage, Power Shortage! Silicon Valley Bigwigs Can't Wait a Day

The Burning Money Logic Behind "Commercial Aerospace Losing Billions"

Google Just Overturned Model Memory, and Nvidia Revolutionized Attention｜Hao Good Chat Paper

01

Using Learning to Replace Attention-Based Memory

02

The Echo of Meta-Learning and Engineering Stability

03

Proving Itself with Loss

04

Returning to the Original Learning May Better Meet the Expectations of Continuous Learning

Related Articles

分享網址