From 'Titans+MIRAS & Nested' Architectural Innovations to NeurIPS2025 Best Paper 'Gated Attention'

Since the publication of "Attention Is All You Need" in 2017, the Transformer architecture has, with its advantages in parallelization, powerful sequence modeling capabilities, and structural feature capture mechanism that captures global context without bias, undeniably driven the rapid development of artificial intelligence represented by the "connectionist school · Deep Learning".

Image
Although Transformer was once inspired by language sequence modeling, it now not only serves as the cornerstone in the field of natural language processing but also acts as the backbone network structure for representing more modalities and cross-spatiotemporal structures in the real physical world (such as Diffusion Transformer <DiT) represented by Sora, which integrates the advantages of diffusion mechanisms · Diffusion and Transformer architectures). It has further given rise to the architectural prototype of today's generative large language models · AIGC LLMs, and has led people to seriously consider and look forward to the possibility of artificial general intelligence (AGI).

The core of Transformer, Self-Attention, simulates the dynamic association and aggregation of information in an elegant, adaptive structure(previously everyone overlooked the 'structure' element, which plays a key and more advanced abstract role in model learning training mechanisms and optimization of forward and backward propagation efficiency) + data-driven way. It can be said that its emergent context learning, instruction following, and even preliminary reasoning abilities mark a key turning point for AI's shift from simple pattern recognition to universal generation and cognitive reasoning systems.

However, with the expansion and evolution of scaling law model parameters and capabilities, the once "invincible" Self-Attention has inevitably encountered bottlenecks in terms of load and complexity, which are becoming increasingly prominent. I believe these problems also stem from methodological and architectural limitations in its original design, including:

Quadratic computational and memory complexity, i.e., the well-known computational complexity O(L²), leading to insurmountable physical barriers when processing ultra-long sequences;

The normalization constraint problem of Softmax, which ensures that the sum of weights is 1, but unexpectedly leads to pathological distributions such as "attention sink," causing the model's focusing ability to dissipate in long-range information;

The deterministic forward propagation formed by feedforward networks and attention layers makes the model tend to be static after pre-training, making it difficult to achieve true continuous learning and dynamic self-optimization;

More fundamentally, traditional Transformer as a "flattened" computational graph updates all parameters at similar frequencies, lacking simulation of the multi-timescale learning and memory consolidation mechanisms of the human brain, leading to knowledge solidification and difficulty adapting to ever-changing task flows.

In the face of these fundamental challenges, academia and industry have not stopped exploring "post-Transformer era" architectures in recent years. These explorations have unfolded along several clear yet intertwined axes, attempting jointly to explore a blueprint that transcends classical paradigms:

In terms of model training and inference efficiency, state space models represented by"Mamba"and dynamic sparse attention represented by"DSA · DeepSeek Sparse Attention", starting from selective state mechanisms and hard-perception sparse routing respectively, aim to reduce the computational complexity of core sequence modeling from O(L²) to linear O(L×K), thereby unlocking the ability to process millions of tokens or even longer contexts. I believe this is also a redesign and breakthrough of Transformer's physical architecture.

In terms of the dynamism and adaptiveness of model training and inference, work represented by the "Titans" framework and its variants explores the possibility of allowing models to learn to modify their own weights during inference (Meta-Attention Controller, MAC) or to introduce dynamic external memory filtering (such as recent Titans+MIRAS). This type of research aims to endow models with a meta-learning dynamic adaptation capability of "test-time memorization," breaking their static nature and enabling them to self-adjust and continuously adapt based on immediate context.

Image
Image

In terms of further model architecture systems and learning paradigms, "Nested Learning" proposes a more fundamental reflection — it attempts to deconstruct the entire model training process into a set of "nested optimization problems" with different update frequencies, treating optimizers, attention layers, etc. as "associative memories" at different levels. This paradigm attempts to provide a unified white-box theoretical framework for designing systems with multi-timescale learning and self-evolution capabilities, pointing us toward building more biologically plausible intelligent systems in the future. Last week, I wrote some analysis about "Nested Learning"; those interested can read it:

A Brief Discussion on Google Research's Latest Achievement 'Nested Learning': Reconstructing the Theoretical Paradigm of Deep Learning Architecture

Exploring Google Research's 'Nested Learning' Again: The Elegance and Brutal Beauty Different from Transformer

Image

Against this background, I've decided to recommend a recent best paper at NeurIPS 2025 "Gated Attention for Large Language Models" (from Alibaba's Tongyi Qianwen team) and try to start with a small but significant work that I believe holds great value for the future.

Image

The paper"Gated Attention"'s core exploration and research focus did not directly choose disruptive paradigms or architectural innovations, nor did it adopt a complete approach to dynamic external memory filtering like Titans+MIRAS. Instead, it took a technical path focusing on internal optimization of core components — systematically conducting empirical analysis and targeted enhancement of the most matureSoftmax attention module in Transformer. Its core discovery is that applying a "head-specific · Sigmoid Gate" generated by the query vector after the standard attention output brings multiple significant benefits, including effectively alleviating attention sink phenomena, improving model performance on various benchmark tasks, and enhancing training stability.

At the same time, I believe the significance of this research lies in showing through rigorous large-scale experiments that in-depth analysis and fine-tuning of core computing units in existing mature model architectures is an efficient and practical way to unlock their potential performance and correct known shortcomings. The dynamic modulation introduced by the gating mechanism essentially adds a flexible nonlinear filtering link to the attention output, which enhances the module's expressive power while also improving the dynamic characteristics of information flow.

Therefore, deeply analyzing this work on gated attention not only helps understand a specific and effective technical improvement but also allows us to recognize in the macroscopic architectural evolution landscape that "continuous deepening understanding and lean optimization of basic components" is as important as "disruptive architectural innovations." In the process of exploring next-generation artificial intelligence infrastructure, such research combining theoretical inspiration with engineering practice is a key component in driving the entire field forward steadily.

Let's look at this "Gated Attention" paper together.

First, the gating mechanism should be familiar to everyone and has a long history in neural networks, from LSTM to modern state space models and linear attention. However, its specific mechanism and contribution are often confused with other architectural improvements. The paper "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" published by Alibaba's research team and their collaborators conducts what is so far the most comprehensive empirical study on gating variants in soft attention through large-scale systematic experiments.

The study found that applying an "Element-wise SDPA Gating" — this extremely simple modification after standard scaled dot-product attention (SDPA) output brings significant performance improvements, enhanced training stability, and effectively eliminates the "Attention Sink" phenomenon. Next, I will attempt to provide a simple interpretation of this research from aspects such as its systematic experimental design, key findings, mechanism attribution, and impact on model capabilities.

Core Method and Systematic Exploration

The research did not propose a completely new complex architecture but adopted a scientific method of "deconstruction and attribution" to isolate the directional effects of the gating mechanism. Researchers introduced gating operations at five key positions in the Transformer attention layer: after query, key, and value projections, after SDPA output, and after the final output layer. For each position, they further explored:

① Gating granularity ("Element-wise" vs "Head-wise");
② Sharing ("Head-Specific" vs "Head-Shared");
③ Combination methods ("Multiplicative" vs "Additive");
④ and Activation Function (Sigmoid vs SiLU).

This ablation experimental design based on multi-dimensional comparison above was trained and evaluated on a dataset of up to 3.5 trillion tokens on both 15B MoE and 1.7B dense models, ensuring the robustness and scalability of the conclusions. The core findings are highly consistent and striking: applying "Multiplicative" + "Head-Specific" + "Element-wise" Sigmoid gating after SDPA output, i.e., "G1," has the most significant effect. This configuration is called "Element-wise SDPA Gating," as shown in the figure below:

Image

Key Findings: Compound Benefits Beyond Performance Improvement

1. Consistent Performance Gains:On both MoE and dense models, "G1·SDPA" can stably reduce test perplexity (PPL) by more than 0.2 and bring significant accuracy improvements on multiple benchmark tasks such as MMLU and GSM8K (see table below). Its effect even surpasses parameter expansion baselines such as simply increasing the number of KV heads or experts.

Image

2. Enhanced Training Stability:The paper found that introducing gating can significantly reduce loss spikes during training, allowing for larger learning rates (e.g., from 4e-3 to 8e-3) and batch sizes (see table below). This stability has significant practical value for safely scaling model size and improving training efficiency.

Image

3. Elimination of Attention Sink:The paper verified that gating can effectively alleviate two known problems:

Attention Sink: In the baseline model, on average 46.7% of attention scores flow to the first token of the sequence, while G1 SDPA gating reduces this proportion to 4.8%, making the attention distribution healthier, as shown in the figure and table below:

Image
Image


Reduction of Massive Activation: Gating also significantly reduces abnormally large activation values in hidden states, i.e., "Massive Activation," which may directly contribute to improved training stability.

4. Improved Long Context Extrapolation: In experiments extending the model context length from 4K to 128K, models with gating performed significantly better than baselines in ultra-long context (64K, 128K) evaluations (see table below). This shows that eliminating attention sink helps models better generalize to longer sequences unseen during training.

Image

Mechanism Attribution: Dual Role of Non-linearity and Sparsity

The paper did not stop at describing phenomena but deeply analyzed the internal reasons for gating effectiveness, attributing it to two core mechanisms:

1. Introduction of Key Non-linearity:In multi-head attention, the continuous operations of value projection matrix <WV> and output projection matrix <WO> are equivalent to a low-rank linear mapping. Introducing G1 Gate after SDPA output or G2 Gate after value projection essentially inserts a nonlinear function between these two linear layers, thereby enhancing the expressive power of this path (see the two formulas below). This also explains why adding gating after final output G5 is ineffective — because it does not break the linearity between <WV> and <WO>.


Image

2. Introduction of Query-Related Sparsity:Analysis found that the most effective G1 Gate produces highly sparse gating scores (mean of approximately 0.116), and this sparsity is dependent on the current query (see table and figure below). This sparsity acts like a dynamic filter that suppresses context information irrelevant to the current query, which is the direct reason for eliminating Attention Sink. Experiments show that if gating scores are forced to be non-sparse (e.g., using NS-sigmoid with range [0.5, 1]) or shared across heads, performance gains are significantly weakened.

Image

Academic and Practical Value

1. Provides Clear Design Guidelines:This paper's research, through rigorous ablation experiments, provides the community with a clear best practice: applying Head-Specific multiplicative Sigmoid Gate after SDPA output. This recommendation, due to its simplicity and effectiveness, has also been integrated into production models such as Qwen3-Next.

Image

2. Deepens Understanding of Attention Mechanisms:The research reveals that Attention Sink and Massive Activation are not simply causally related (for example, G2 Gate can eliminate Massive Activation, but Attention Sink still exists). It emphasizes the importance of query-based, Head-Specific sparsity for forming healthy attention distributions.

3. Connects Commonalities Among Various Improvements:The paper points out that technologies such as adding RMSNorm and Sandwich Norm may partially stabilize models in ways similar to gating, by constraining or modulating activation values in the attention output path in some way. This provides a unified perspective for understanding a series of training stabilization techniques.

4. Opens New Directions for Long Context Modeling:Demonstrating that eliminating Attention Sink is beneficial for context window expansion provides new technical paths for improving the training and inference of long context models, not just adjusting positional encoding.

Simple Summary and Outlook on the Paper

It can be said that "Gated Attention for Large Language Models" is a model study of "empirical drive, clear mechanism." Through empirical and systematic re-examination of the classic technical component "Gate," it has obtained certain deep insights and practical value. Its contribution lies not only in an effective technique but also in using data as a ruler to clarify the conditions and fundamental reasons for the effectiveness of an important mechanism.

I believe this work also inspires us that while pursuing complex architectural innovations, fine-grained analysis and "minimally invasive surgical" enhancement of existing core components (such as attention) can also bring comprehensive breakthroughs in performance, stability, and scalability. Gated Attention, with its minimal computational overhead (latency increase <2%) and significant compound benefits, is expected to become one of the standard configurations for the next generation of large language model attention layers. Future research can further explore the relationship between gating and model scaling laws, and its role in more complex tasks such as multimodal and reasoning.

Additionally, we know that on Thanksgiving Day, DeepSeek released V3.2, where one of its technological innovations was brought up → DSA (DeepSeek Sparse Attention, sparse attention) and became a focus of discussion. However, it seems that Gated Attention itself focuses its core optimization direction to some extent on "sparsity". So, what are the differences, similarities, or respective advantages between Gated Attention and Sparse Attention in terms of methods and ideas?

Image

Therefore, regarding sparsity, I will additionally provide a comparison explanation between "Gated Attention" and "DSA". Similarly, I believe "MoE" itself is positioned as a type of sparsification at a certain structural scale and implementing different hierarchical dimensions.

Before comparison, my core insight or thinking is: Although these two methods both focus on "attention," I believe their starting points and the "levels" at which they intervene are essentially different.

To better illustrate, here's an analogy:

Gated Attention is like a lean management consultant who acknowledges that the existing production line (standard Softmax attention) is effective but has some inherent defects (such as expressive power and sink phenomena). Therefore, instead of changing the backbone process of the production line, he adds an intelligent quality inspection and modulation process (Gate) at key nodes (such as after SDPA output), dynamically optimizing the quality, stability, and consistency of the final product by introducing non-linearity and sparsity.

While DSA is like a technical architect. He believes that the O(L²) complexity full-attention of the existing production line has fundamental efficiency bottlenecks when handling ultra-long orders (long sequences). Therefore, his goal is to introduce an intelligent scheduling center (such as a fast indexer) to dynamically select the most critical parts before raw materials (historical context tokens) enter the core production line, thereby completely reconstructing the production process and reducing complexity from O(L²) to linear or quasi-linear O(L×N), achieving order-of-magnitude efficiency improvements.

Below, I will attempt to unfold the core differences between these two methods from several comparison dimensions:

Comparison from Underlying Method Principles

Gated Attention achieves dynamic network self-adaptation based on dense computation, with its core principle coming from re-processing the attention output after complete computation. It first executes standard Softmax attention that computes relationships between all token pairs, obtaining a dense context vector; then, it applies element-wise multiplicative modulation to this output using a Head-Specific Sigmoid Gate Vector generated by the current Query. This process can be vividly understood as "casting a wide net first, then precise filtering."

Its sparsity is reflected in soft sparsity at the feature/channel dimension: the gating coefficients are between 0 and 1, where features with values close to 0 are suppressed, while the attention computation itself remains dense and quadratic. The sparsity is reflected in the dynamic selective retention of output features.

DeepSeek Sparse Attention (DSA), on the other hand, is based on dynamic routing sparse computation methods, with its core principle being dynamic pruning before attention computation occurs. DSA uses a lightweight Lightning Indexer to quickly evaluate relevance scores between all historical tokens and the current query, then uses a Token Selector to select only the most relevant Top-K tokens (for example, selecting 2048 tokens from a 128K context). Subsequent complex attention computations (such as MLA, Multi-Head Latent Attention) are only performed on this small selected portion of tokens.

Its sparsity is reflected in hard sparsity at the token relationship dimension: by constructing a dynamic, binary attention mask, it directly avoids computing attention weights between most unimportant token pairs, fundamentally changing the entire computation graph.

Comparison in Design Objectives and Achieved Effects

Although both can bring performance improvements, I believe their primary goals and main achievement areas are quite different.

Gated Attention aims to optimize and enhance the intrinsic expressive power and training dynamics of the standard Transformer architecture. Its primary goal is to improve the performance ceiling and training stability of the standard Transformer rather than directly reducing computational costs, i.e., solving two theoretical limitations in the standard attention mechanism:

1) The low-rank linear mapping formed by value projection and output projection;

2) The rigidity problem of attention score distribution caused by Softmax normalization (such as Attention Sink);

Key effects of Gated Attention:

<1> Performance Improvement:By introducing non-linearity between the WV and WO linear projections, it enhances the expressive power of the low-rank attention mapping, thereby generally improving the model's performance on various benchmark tests.

<2> Elimination of Attention Sink:Query-related sparse gating can effectively filter context information irrelevant to the current token, significantly alleviating the Attention Sink phenomenon where the first token of the sequence excessively attracts attention.

<3> Training Stability:By suppressing abnormally large activation values (Massive Activation), it allows for larger learning rates and batch sizes during training, reducing loss spikes.

While DeepSeek Sparse Attention (DSA) focuses on breaking through the computational and memory bottlenecks of Transformer models when processing ultra-long sequences, its primary goal is to achieve efficient, low-cost training and inference for ultra-long contexts (such as 128K).

Key effects of DSA:

<1> Complexity Reduction:This is its most core contribution. By limiting computation to Top-K tokens, DSA reduces the core attention computation complexity from O(N²) to O(N·k), where k is a fixed small constant (such as 2048). This makes the per-token computational cost almost constant when processing extremely long sequences, achieving near-linear scalability.

<2> Cost-effectiveness:The significant efficiency improvement directly translates into substantial cost savings. The DSA-based DeepSeek-V3.2 model can provide comparable inference performance to closed-source models like GPT-5 at a much lower cost, resulting in significant API price reductions.

<3> To some extent promotes further refinement of the chain of thought:DSA enables models to consume massive "reasoning tokens" at an affordable cost. For example, DeepSeek's high-performance variant Speciale, when solving complex mathematical problems, actively generates extremely long chains of thought (consuming about 77K output tokens per solution on average), exchanging for higher answer quality by increasing "test-time compute." — I believe that in the future, for some extremely complex mathematical proofs and scientific exploration tasks, DSA's Hard Sparse, in addition to filtering redundant historical context tokens, can also further promote capturing and learning more advanced and abstract global optima during the training process.

To help everyone better grasp these two approaches intuitively, I attempt to organize their key comparison dimensions (Note: Some of these are my own immature subjective understandings and perceptual judgments, which may not be very logical. Please make your own clear judgments based on your understanding, don't be misled, and discussion is welcome):

Comparison of Core Ideas and Methods

Gated Attention:Internal enhancement and repair of existing mature architectures, adopting an incremental optimization approach aiming to mine and solve parts of the standard attention mechanism that are not fully utilized or have defects.

DSA:Architectural reconstruction of bottlenecks, adopting an innovative path outside the structure, aiming to redesign the attention mechanism to address the fundamental challenge of quadratic complexity.

Comparison of the Nature of Sparsity

Gated Attention:Post-computation, soft, feature-level sparsity, occurring after complete attention computation, as a refined dynamic soft adjustment of results.

DSA:Pre-computation, hard, token-level sparsity, occurring before core attention computation, as a fundamental simplification of the computation process.

Comparison of Differentiated Design Intentions and Goals

Gated Attention:Pursues better model quality (lower perplexity, higher accuracy), more robust training processes, and healthier attention distributions.

DSA:Pursues higher computational efficiency (extremely low long-context costs), better scalability (supporting hundred-K level contexts), and higher cost-effectiveness.

Comparison of Computational Complexity

Gated Attention:Maintains standard O(N²) complexity, with the added multiplicative gating operation achieving small computational overhead.

DSA:Reduces core computation to O(N·k). Although the indexer itself has computational costs, they are much lower than the saved dense attention computation.

Comparison of Challenges and Innovation Dimensions Faced by Each

Gated Attention:The challenge lies in systematically verifying the effectiveness of an extreme modification to the internal structure of the original Transformer and attributing its effect to the important role of two explainable mechanisms: attention's native "non-linearity" and "query-related sparsity."

DSA/NSA:The challenge lies in designing a more thorough hard sparsity solution that is usable throughout training and inference, well-adapted, and does not lose model capabilities. The innovation of its predecessor NSA is precisely in "native trainability" and "full-stage acceleration."

How to Choose and Future Outlook

Through their comparison, I hope it can help everyone better judge their potential applicable scenarios and application methods in the future.

For example, when to consider applying Gated Attention?

When training or fine-tuning a standard Transformer architecture model, with main task sequence lengths within normal ranges (e.g., ≤32K), and your optimization goal is to further improve the model's performance, stability, and long-context generalization capabilities on various tasks, I think Gated Attention is a simple, low-risk, high-return enhancement plugin.

Of course, looking at the actions of next-generation base models like Qwen3-Next, the large-scale application of Gated Attention in the Pre-Training phase in the future might be able to conduct more thorough end-to-end optimization from the perspective of fundamental underlying language structure basic generalization performance. And I think whether it's network reconstruction in the Pre-training phase or plugin-style application in the Post-training phase, the theoretical essence and black-box mechanism of sparsity in the Attention layer is still a direction we need to continuously explore and discover in the future.

When to consider DSA class sparse attention?

When your core task is processing ultra-long documents (such as entire books, long codebases), conducting complex reasoning that requires extremely long chains of thought, or strictly controlling the cost of long-text interactions in large-scale services, I think DSA-type solutions are one of the basic architectures that existing base models should prioritize, i.e., positioned as key architectural improvement technology for scenarios where efficiency is the core requirement.

Possible directions for combining both?

First, intuitively, from specific technical implementation paths to final possible effects, I think the two are not mutually exclusive, and fusion solutions may appear in the future. For example: on an efficient backbone network based on sparse attention, introduce gating mechanisms in certain key layers to further optimize local expression and stability, achieving synergistic gains in efficiency and quality. Of course, for the sake of rigor, more ablation experiments and theoretical exploration of the underlying mechanisms of both types of sparsification would need to be considered.

In summary, Gated Attention and DSA represent two key directions in the current optimization of large model attention mechanisms: one focusing inward, dedicated to unlocking the "120%" potential of classic architectures; the other expanding outward, dedicated to breaking physical limits and pushing the model's field of view and thinking costs to new boundaries.

Although they have not attempted to propose a grand and disruptive innovative architecture or paradigm like Google or other research institutions (such as "Titans+MIRAS" & "Nested Learning"), this step-by-step theoretical breakthrough and experimental validation can also jointly promote the continuous evolution of large model technology.

By Lu Ming


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.