Transformer Authors Lead Sakana AI to Release Three Papers: Completely Reconstructing Long-Text Memory Mechanisms

In the post-Transformer era, from "dropping positional encoding" to "external memory," what did Sakana AI get right?

When long windows of 128K or even 1M become standard for large models, it seems many take it for granted that simply extending the context window will naturally lead to long-text understanding capabilities.

Addressing this status quo, the Sakana AI team, led by Transformer co-author Llion Jones, recently released three consecutive papers directly targeting the model architecture itself.

This set of work does not continue with simple incremental patches on existing architectures. Instead, from three dimensions—removal of positional encoding (DroPE), reconstruction of position-awareness (REPO), and introduction of dynamic memory mechanisms (FwPKM)—it systematically questions and proposes reconstruction schemes for how Transformers handle long sequences.

They collectively point to a core viewpoint: the bottleneck in a model's ability to process long texts lies not in the window size being insufficient, but in the fact that existing attention mechanisms and static parameters cannot effectively adapt to dynamic needs during inference.

The Generalization Dilemma of RoPE in Long Texts

Since Llama's popularization, Rotary Positional Encoding (RoPE) has become the standard for large models.

RoPE encodes absolute positional information as the rotation angle of vectors, giving the model a concept of relative position. Its core calculation method is as follows:

Here, makes the attention score depend only on the relative distance between tokens.

To enable the model to process longer texts than during pre-training, the current mainstream approach in the industry (such as YaRN, PI) is to scale the rotation frequency:

However, in Sakana AI's research [1], researchers found a key issue through heatmap analysis: this scaling strategy is not a lossless mathematical game but a real data lossy compression.

Figure 1. Visualization shows that methods like YaRN, when processing long texts, actually force attention to be limited within the training length window (similar to a soft truncation), preventing the model from effectively retrieving distant information outside the window.

Besides not seeing far, a more serious problem is seeing incorrectly.

Figure 2. This diagram intuitively shows the side effects of RoPE Scaling on semantic understanding.

In the NIAH test, when using YaRN to extend the context, the attention head (Semantic Head) responsible for capturing specific semantics (such as Key-Value pairs) showed significant weight shifts (Attention Mass Shift).

The forced scaling of positional encoding interferes with the model's semantic matching of content, leading the model to call a deer a horse.

DroPE

Paper Title: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Paper Link: https://arxiv.org/pdf/2512.12167

Code Link: https://github.com/SakanaAI/DroPE

Addressing the limitations of RoPE in extrapolation, Sakana AI proposed a counterintuitive hypothesis: positional embeddings may only be necessary during the training phase, and when processing long texts during inference, they may actually be an obstacle.

This is known as the Scaffolding Theory. Just as scaffolding must be erected when constructing a building, positional embeddings provide the necessary inductive bias for the model in the early stages of pre-training, helping the model understand sequence order and accelerate convergence.

However, after the "building" (model capability) is completed, retaining the scaffolding (positional embeddings) continues to block the view (limiting extrapolation capabilities).

1. Why can't we just remove PE?

The paper first verifies the feasibility of not using positional encoding at all (NoPE). By analyzing Attention Positional Bias (APB):

Experimental data shows that in the initialization phase, if explicit positional encoding is lacking, the gradient norm of the Attention matrix is extremely small, making it difficult for the model to capture the causal structure of the sequence, and training convergence is extremely difficult.

Therefore, RoPE must be retained during the pre-training phase.

2. Removal and Recalibration During Inference

DroPE (Dropping Positional Embeddings) has a very concise specific scheme:

1. Complete pre-training using RoPE normally.

2. After pre-training ends, completely remove all positional embeddings.

3. Use data from the original context window (e.g., 4K) for extremely short recalibration training to allow the model to adapt to the inference mode without positional embeddings.

3. Experimental Results

Experimental results show that models processed with DroPE can generalize to ultra-long texts without fine-tuning on long-text data.

Figure 3. In the 2x context length (8K) Multi-Query NIAH (Needle in a Haystack) task, the accuracy of RoPE-Base dropped significantly to around 0%, while DroPE maintained a retrieval accuracy of nearly 100%.

The data in the table below further quantifies this difference. In the more difficult Multi-Key retrieval task, the accuracy of RoPE+YaRN was only 0.5%, while DroPE reached 41.6%.

Table 1. Performance comparison of DroPE and RoPE variants in 2x long-text extrapolation.

This indicates that when the interference of positional embeddings is removed, the Transformer can rely more purely on semantic relevance for retrieval, thereby releasing the suppressed long-distance capture capability.

REPO

Paper Title: REPO: Language Models with Context Re-Positioning

Paper Link: https://arxiv.org/pdf/2512.14391

Code Link: https://github.com/SakanaAI/repo

DroPE chose to "do subtraction" to solve the extrapolation problem, while REPO attempts to "do addition"—reconstructing position-awareness.

In this paper, the Sakana AI team raises a core question: why must the position index of a token be fixed integers like 0, 1, 2, 3?

1. Introduction of Cognitive Load Theory

The paper introduces Cognitive Load Theory from cognitive science. The author believes that when processing natural language, many function words and filler words do not carry key information.

Forcing these irrelevant tokens to be assigned linearly growing position indices actually increases the model's irrelevant cognitive load.

2. Content-Aware Position Generation Module

REPO introduces a lightweight differentiable module , which no longer relies on predefined integer sequences but dynamically generates position values based on the hidden states of tokens.

The calculation formula is as follows:

Here, is the representation of the current token, which is processed by a gating mechanism and mapped to a scalar position .

Subsequently, this dynamically generated is substituted into the RoPE formula to replace the original integer index :

At this point, the relative distance in the attention mechanism becomes a dynamic variable based on semantic content.

3. Visualization: Nonlinear Position Distribution

The position distribution after REPO training exhibits very interesting characteristics.

Figure 4. The vertical axis is the dynamic position assigned by REPO, and the horizontal axis is the original linear position. It can be seen that the assigned positions are not a straight line but show obvious fluctuations. The model learns to adjust the logical position of tokens based on content, and some punctuation marks are even assigned negative values or the same values, achieving the folding of invalid information.

In the evaluation of comprehensive capabilities, REPO demonstrates extremely strong targeting.

Figure 5. Radar chart compares the performance of REPO with RoPE, NoPE, and other benchmark models across different tasks.

It can be seen that REPO is comprehensively superior in three dimensions: noisy context, structured data, and long context, while maintaining performance comparable to RoPE in general short-text tasks.

FwPKM

Paper Title: Fast-weight Product Key Memory

Paper Link: https://arxiv.org/pdf/2601.00671

The first two papers mainly optimize the position-awareness inside the Attention mechanism, while the third paper, FwPKM, attempts to solve a fundamental shortcoming at the Transformer architecture level: it lacks an external memory module that must be able to read and write in real-time and have expandable capacity.

1. From Static PKM to Dynamic Fast Weights

Traditional Product Key Memory (PKM) uses large-scale retrieval of key-value pairs to expand model capacity, but it is usually Slow Weights, meaning it is only updated during the training phase and frozen during the inference phase.

Sakana AI proposes FwPKM, which transforms it into a Fast Weights system. Its core innovation is: during the inference phase, the model updates the parameters of the memory module in real-time based on the current input data.

Figure 6. FwPKM architecture diagram.

2. Real-time Writing Based on Gradients

FwPKM uses local reconstruction error as a signal to perform one or more gradient descent steps during the forward propagation process.

The specific parameter update rules are as follows:

When the model reads a new text segment, it is not only calculating Attention but also directly writing this information into the Value matrix (Fast Weights) of FwPKM through gradient updates, while keeping the Key matrix as a stable addressing benchmark.

To prevent memory collapse (i.e., all Queries pointing to the same Key), FwPKM introduces an Addressing Loss that maximizes marginal entropy:

3. Iterative Reading: Reviewing to Improve Memory Quality

Since memory is dynamically written during inference, FwPKM verifies a phenomenon similar to human cognition: reviewing (Iterative Reading) can significantly improve memory effects.

Figure 6. In the 128K-length NIAH test, the model's performance with single reading (1-iter) was average, but once 2-iter (reading twice) or 3-iter was enabled, the accuracy showed a qualitative leap, reaching SOTA level. This confirms that through multiple Test-Time Training, the model can more firmly grasp long-context information.

Finally, we can clearly see the unique niche of FwPKM in memory mechanisms through the table below: it is the only architecture solution that simultaneously possesses large storage capacity and supports memory during inference.

Table 2. Characteristic comparison of FwPKM with standard Attention and traditional PKM.

Summary and Outlook

These three works are not isolated optimizations but reflect a clear technical shift: from static fitting during pre-training to dynamic adaptation during inference.

DroPE proves that for long-text inference, removing artificially designed static position constraints can instead release the model's ability to capture deep semantics.

REPO proposes that positions themselves should not be fixed but should be generated in real-time based on content to reduce the model's cognitive load.

FwPKM further introduces Test-Time Training, enabling the memory module to update and expand in real-time during the inference process.

Such architectural exploration indicates that to solve long-text problems, besides relying on hardware to stack context lengths, a more essential solution may lie in giving the model the ability to adjust its own state in real-time during the inference phase.

This provides a more efficient evolutionary direction for the design of next-generation large models than simply expanding memory.

References

[1] Gelberg, Y., Eguchi, K., Akiba, T., & Cetin, E. (2025). Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings. arXiv preprint arXiv:2512.12167.

[2] Li, H., Zhao, T., & Sproat, R. (2025). REPO: Language Models with Context Re-Positioning. arXiv preprint arXiv:2512.14391.

[3] Zhao, T., & Jones, L. (2026). Fast-weight Product Key Memory. arXiv preprint arXiv:2601.00671.

More Reading

#投稿通道#

让你的文字被更多人看到

如何才能让更多的优质内容以更短路径到达读者群体，缩短读者寻找优质内容的成本呢？答案就是：你不认识的人。

总有一些你不认识的人，知道你想知道的东西。PaperWeekly 或许可以成为一座桥梁，促使不同背景、不同方向的学者和学术灵感相互碰撞，迸发出更多的可能性。

PaperWeekly 鼓励高校实验室或个人，在我们的平台上分享各类优质内容，可以是最新论文解读，也可以是学术热点剖析、科研心得或竞赛经验讲解等。我们的目的只有一个，让知识真正流动起来。

📝 稿件基本要求：

• 文章确系个人原创作品，未曾在公开渠道发表，如为其他平台已发表或待发表的文章，请明确标注

• 稿件建议以markdown格式撰写，文中配图以附件形式发送，要求图片清晰，无版权问题

• PaperWeekly 尊重原作者署名权，并将为每篇被采纳的原创首发稿件，提供业内具有竞争力稿酬，具体依据文章阅读量和文章质量阶梯制结算

📬 投稿通道：

• 投稿邮箱：hr@paperweekly.site

• 来稿请备注即时联系方式（微信），以便我们在稿件选用的第一时间联系作者

• 您也可以直接添加小编微信（pwbot02）快速投稿，备注：姓名-投稿

△长按添加PaperWeekly小编

🔍

现在，在「知乎」也能找到我们了

进入知乎首页搜索「PaperWeekly」

点击「关注」订阅我们的专栏吧

Transformer Authors Lead Sakana AI to Release Three Papers: Completely Reconstructing Long-Text Memory Mechanisms

In the post-Transformer era, from "dropping positional encoding" to "external memory," what did Sakana AI get right?

Related Articles

分享網址