Stop Clipping Aggressively! Qwen Proposes GatedNorm, Unifying the Perspective on Residual Flow Mysteries

图片

MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with an audience covering NLP master's and doctoral students, university teachers, and corporate researchers.

The community's vision is to promote communication and progress among academia, industry, and enthusiasts in natural language processing and machine learning, especially for beginners.

Source | PaperWeekly

In the training process of Transformer, if you pay a little attention to the distribution of weights or activation values, you will find a strange phenomenon in the residual flow: regardless of the input token, the activation values of certain fixed dimensions are always significantly higher than other dimensions.

Meanwhile, the first token in the Attention Map (usually <BOS>) often occupies an extremely high attention weight (Attention Sink).

In engineering practice, to address numerical stability or quantization overflow, common processing methods often attempt to clip or suppress them through regularization.

The latest paper released by the Alibaba Qwen team points out that these outliers are not a product of training instability, but a rescaling mechanism that the model spontaneously evolves under normalization constraints.

This work not only unifies the explanation of the sink phenomena commonly found in models such as DeepSeek-V3, Qwen, and GPT-OSS, but also proves from the mathematical foundation that forcibly removing these outliers is equivalent to destroying the model's feature adjustment capability.

Based on this, Qwen proposed a parameter-efficient architectural improvement—GatedNorm, which replaces unstable outliers with an explicit gating mechanism, effectively solving the precision problem of low-bit quantization at the architectural level.

图片

Paper Title:

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

Paper Link:

https://arxiv.org/pdf/2601.22966

Universal "Outliers"

The Qwen team conducted a cross-architectural comparative analysis of Qwen3 and GPT-OSS, and the results show that this anomaly is a common feature of Transformer.

图片

Figure 1. Visualization of Attention Sink and Residual Sink for Qwen3 and GPT-OSS

As shown in the figure above:

Attention Sink: The first token absorbs most of the attention logits, causing the weights of other tokens to be relatively suppressed.

Residual Sink: In models such as Qwen3-235B, the activation values of specific dimensions (e.g., dimensions 1806, 1423) show continuous high values independent of input.

This phenomenon is particularly extreme in DeepSeek-V3.

As shown in the statistics below, the maximum activation value in its residual flow reaches an astonishing 264192.0, while the values of conventional dimensions are usually only in the magnitude of 10.

图片

Figure 2. Statistics of Attention Sink and Residual Sink for DeepSeek-V3

In FP16/BF16 training, this numerical value is still tolerable.

However, in INT4 or FP4 quantization scenarios, the huge dynamic range will force the quantization parameters to expand dramatically to accommodate the maximum value, causing the small values that carry core semantics to lose precision during quantization.

Unified Perspective: Outlier-Driven Rescaling

Why does the model spend huge energy to maintain these seemingly useless outliers? The Qwen team believes that this is an adaptive behavior generated by the model to counteract or utilize the characteristics of the normalization layer.

1. The Mathematical Essence of RMSNorm

Return to the definition of RMSNorm. In the paper appendix, the authors give the formal expression of the normalization layer:

图片

When there is a huge outlier in the input vector, the norm on the denominator will be dominated by this value and increase significantly.

This actually constitutes a global scaling lever. The model only needs to push up the values of a few specific dimensions, and through the division property of RMSNorm, it can globally compress the amplitude of all other feature dimensions.

The paper further gives a strict mathematical proof: the upper bound of the feature norm of LayerNorm output decreases monotonically with the increase of the outlier amplitude.

图片

2. Unified Perspective

Under this theoretical framework, Attention Sink and Residual Sink are essentially isomorphic:

Attention Sink: Utilizes the normalization characteristics of Softmax. By pushing up the logits of the first token (denominator increases), it suppresses the attention weights of other tokens, achieving the filtering of invalid information.

Residual Sink: Utilizes the normalization characteristics of RMSNorm. By pushing up the activation values of specific dimensions (denominator increases), it adjusts the contribution ratio of inter-layer residual connections.

The model is not making an error, but rather using the mathematical characteristics of the normalization layer to evolve an efficient global scaling strategy.

Why Does the Clipping Strategy Fail?

Understanding this mechanism can explain why the common clipping strategy in engineering often leads to model collapse.

If we forcibly clip the outliers in the residual flow (e.g., clip to 1000), the denominator of RMSNorm will instantly become smaller, causing the originally compressed feature amplitude to expand abnormally.

This destroys the feature distribution that the model has already learned, leading to training divergence.

The ablation experiments in the paper further confirm: even if the normalization layer is removed, the model performance will decrease significantly.

This shows that "rescaling" is not a side effect of the normalization layer, but a necessary condition for Transformer training stability.

图片

Table 1. Data shows that removing Norm or aggressively clipping outliers (Row 12) both cause Loss to increase instead of decrease, proving that outliers are a necessary condition to maintain model performance.

This also explains a long-standing debate in architectural design from the side: Why is SwiGLU usually better than GLU?

图片

SwiGLU uses the Swish activation function, which has no upper bound on the positive half-axis, allowing the model to easily generate huge outliers to trigger rescaling. The standard GLU uses Sigmoid, with a value range limited to (0, 1), which restricts this adaptive scaling capability.

Solution: GatedNorm

Since rescaling is a necessity, rather than letting the model rely on unstable outliers to achieve it, it is better to provide an explicit control path at the architectural level.

The Qwen team proposed GatedNorm. Its core idea is to introduce a learnable gating mechanism after RMSNorm.

图片

Where is the output of RMSNorm. and constitute a lightweight Bottleneck structure (Rank=16), with a parameter increase of only about 2% and extremely low computational overhead.

After introducing GatedNorm, the model has a legal scaling method and no longer needs to generate extreme outliers.

The heatmap comparison shows that in the GatedNorm model, the dark vertical stripes in the residual flow almost completely disappear, and the feature distribution returns to smoothness.

图片

Figure 3. Comparison of residual flow heatmaps for Baseline, PreAffine, and GatedNorm

More notably, when GatedNorm complements the scaling capability, GLU's performance surpasses SwiGLU.

As shown in the figure below, GLU + GA + GatedNorm achieves the lowest Loss and no longer produces severe fluctuations. This indicates that SwiGLU's previous advantage largely stems from its easier generation of outliers to assist scaling.

图片

Figure 4. Loss and outlier comparison during training for SwiGLU and GLU

Key Application

For the industry, the greatest value of GatedNorm is clearing the obstacles for low-bit quantization.

Since GatedNorm fundamentally eliminates Massive Activations, the distribution of activation values becomes compact and eliminates long-tailed effects, greatly reducing the difficulty of quantization.

In aggressive FP4 (W4A4) testing:

图片

Table 2. Performance comparison of 7B and 24B MoE models under FP4 quantization

PreAffine (frontier control group): In mathematical tasks such as MGSM, the accuracy rate decreased significantly (58.46 -> 49.58), indicating that relying solely on parameters to absorb outliers cannot solve quantization loss.

GatedNorm: Robust performance. MGSM decreased by less than 2 points (55.47 -> 53.70), and in Code tasks, it was even slightly higher than the Baseline before quantization.

This shows that models trained with GatedNorm naturally have affinity for W4A4 inference architecture and do not require complex post-training quantization adjustments.

Conclusion

This study reveals a mechanism in the Transformer architecture that has long been overlooked: Attention Sink and Residual Sink are not design flaws, but functional features that emerge from the model under normalization constraints to achieve "feature rescaling".

The table below summarizes the core insights of the paper. Instead of trying to clip these outliers after training, it is better to provide an explicit scaling channel through GatedNorm during the design phase.

图片

Table 2. Comparative summary of Attention Sink and Residual Sink under a unified perspective

For teams committed to training small-parameter models, optimizing MoE architectures, or having clear requirements for W4A4 inference efficiency, GatedNorm provides a theoretically complete and extremely easy-to-use architectural upgrade direction.


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.