Rotate Attention by 90 Degrees! Today, Kimi's 'Attention Residuals' Takes Off

Readers familiar with deep learning neural networks will certainly not be unfamiliar with "Residuals."

Since the birth of ResNet in 2015, this simple logic of "directly adding the input to the output" has dominated almost all neural network architectures.

But just now, the residual mechanism used for ten years has been "upgraded." As one might imagine, the alternative method is surprisingly the "Attention Mechanism."

Even Jerry Tworek, known as the "Father of OpenAI's Reasoning Models," who led the o1/o3 series, the Codex programming model, and the STEM capability development of GPT-4, was deeply inspired by this paper. He believes we should rethink everything, as the era of "Deep Learning 2.0" is approaching.

This work, which subverts the traditional residual connection mechanism, comes from the Kimi team. They have released a major technical report: Attention Residuals. This method aims to replace standard deep recursion with an input-dependent attention mechanism learned from preceding layers.

Paper Title: Attention Residuals
Paper Link: https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf
Project Link: https://github.com/MoonshotAI/Attention-Residuals

The Duality of Time and Depth

To understand what Attention Residuals is doing, we must first look at what went wrong with the traditional residual connection $y = x + f (x)$ .

In the evolution of large models towards greater depth and strength, this additive mechanism of residuals has brought two side effects:

1. Information Dilution: Residual connections use uniform aggregation with fixed unit weights. This causes the relative contribution of shallow features to decay linearly with depth as they are passed to deeper layers. This phenomenon of "information dilution" limits the ability of deep networks to directly utilize low-level raw representations. As the number of layers increases, the information from the first layer, when transmitted to the hundredth layer, has been diluted layer by layer by the information from the subsequent ninety-nine layers.

2. Hidden State Explosion: To maintain signal strength in the constantly accumulating residual stream, deep modules often need to output activation values with larger magnitudes. This uncontrolled expansion of hidden states not only destroys numerical stability but also leads to uneven gradient distribution, increasing the difficulty of training convergence for ultra-large-scale models and directly causing training instability.

The genius of this paper lies in discovering that a model's "depth" is actually another form of "time."

Yulun Du, one of the paper's authors, revealed the core idea: "Rotate Attention by 90 degrees."

Attention Residuals (AttnRes) was born from this: equipping every layer with an "intelligent filter." Each layer issues a Query to search for the most relevant features in all previous layers and aggregates them with allocated weights as needed.

Just as RNNs compress all prior information into a single state in the time dimension, residual connections also compress all prior information into a single state in the depth dimension. In the field of sequence modeling, Transformers surpassed RNNs by replacing recursion with an attention mechanism, allowing each position to selectively access all previous positions through data-dependent weights. The research team proposed the same method for "depth":

Where $α_{l, k}$ represents layer-specific attention weights, satisfying $\sum_{k = 1}^{l} α_{l, k} = 1$ . Unlike sequence lengths that often reach millions of tokens, network depth is usually relatively shallow ( $L < 1000$ ), making the $O (L^{2})$ complexity attention mechanism in the depth direction computationally feasible.

Attention Residuals

Theoretical Reconstruction: Complete Attention Residuals

Traditional residual connections (ResNet) are essentially deep recursion: like RNNs, they rigidly "compress" information from all past layers into a summation state.

Attention weights can be represented as an exponential kernel function with normalization, i.e., executing Softmax Attention in the depth dimension:

Core Innovation: Since Transformers used attention mechanisms to replace RNNs, solving the forgetting problem in long sequences; AttnRes replaces residual accumulation in the depth dimension.
Mathematical Implementation: Instead of simply adding the previous layer, each layer now issues a learnable Query to match against the Keys produced by all preceding layers.
Softmax Weights: Through Softmax normalization, the model can "select" the few layers most useful to itself. For example, the 50th layer can directly extract features from the 2nd layer with a weight ratio as high as 0.8, without worrying about being diluted by the intervening 48 layers.

Engineering Implementation: The Block AttnRes Strategy

While Full Attention (Full AttnRes) is perfect, it leads to an explosion in memory and communication volume in ultra-deep models ( $O (L^{2})$ complexity). To make the model runnable, the research team designed a block structure.

Local Summation (Intra-Block): The model is divided into $N$ blocks. Within a block, the outputs of layers are still simply accumulated, reducing them to a single "block representative" (Representation):

Global Scheduling (Inter-Block): When performing residual aggregation, each layer no longer looks at "every layer" but rather at "every block." For the $i$ -th layer in the $n$ -th block, its Value matrix is defined as:

Under this design, the network's first layer receives token embeddings; the first layer of each block receives all previous block representations plus token embeddings; subsequent layers within a block additionally focus on the accumulated results already produced within the current block. The final output layer aggregates all $N$ block representations.

Efficiency Miracle: Experiments found that even with models having hundreds of layers, dividing them into approximately $N \approx 8$ blocks can achieve the vast majority of performance gains.
Complexity Plummet: Memory overhead drops from growing with the number of layers $L$ to growing with the number of blocks. This means you can obtain a "smarter" deep network at a minimal cost (inference latency increase < 2%).

Figure 1: Overview of Attention Residuals: (a) Standard Residuals: Traditional residual connection method using uniform additive accumulation. (b) Full Attention Residuals (Full AttnRes): Each layer selectively aggregates outputs from all previous layers through learned attention weights. (c) Block Attention Residuals (Block AttnRes): Layers are divided into several "blocks," reducing memory overhead from $O (L d)$ to $O (N d)$ .

Results: A 1.25x "Computational Leverage"

According to the paper, the experimental architecture is completely consistent with Kimi Linear, which is a Mixture of Experts (MoE) Transformer following the Moonlight / DeepSeek-V3 design. The only modification is the addition of AttnRes in the residual connections; other components such as model depth, hidden dimensions, expert routing, and MLP structure remain unchanged.

The research team tested five model scales and trained three variants for each scale: a PreNorm baseline model, Full AttnRes, and Block AttnRes with approximately 8 blocks.

The figure below shows the fitted scaling curves.

The slopes of the three variants are similar, but AttnRes consistently achieves lower loss across the entire computational range. Based on the fitted curves, at a compute budget of 5.6 PFLOP/s-days, the loss for Block AttnRes is 1.692, while the baseline model is 1.714. This equates to a 1.25x Compute Advantage. As the model scale increases, the gap between the Full and Block variants narrows.

The team's largest model is based on the Kimi Linear 48B configuration: 27 Transformer blocks (totaling 54 layers), activating 8 out of 256 routing experts plus 1 shared expert, with a total of 48B parameters and 3B activated parameters. This model uses Block AttnRes, with 6 layers per block, producing 9 blocks plus 1 token embedding, forming 10 sources in the depth direction.

The figure above shows the dynamic changes during the model's training process on 1T tokens:

Validation Loss: AttnRes consistently maintains lower validation loss throughout the training process, with the gap widening further during the decay phase.
Output Magnitude: The baseline model suffers from the PreNorm dilution problem: as the magnitude of hidden states grows monotonically with depth, deep networks are forced to learn increasingly larger outputs from fixed-scale normalized inputs to maintain influence. In contrast, Block AttnRes confines this growth within each block, resetting the accumulation process through selective aggregation at block boundaries, presenting a bounded periodic pattern.
Gradient Magnitude: In the baseline model where all residual weights are fixed at 1, the gradient flow distribution across depth is extremely uneven, causing excessively large gradients in early layers. The learnable Softmax weights of Block AttnRes introduce competition between sources, thereby achieving a significantly more uniform gradient distribution.

Downstream Performance: As shown in the table above, Block AttnRes meets or exceeds the baseline model in all evaluation tasks.

Significantly Improved Tasks: Improvements are particularly prominent in multi-step reasoning tasks, such as GPQA-Diamond (+7.5), Minerva Math (+3.6), and code generation HumanEval (+3.1).
Knowledge-based Tasks: MMLU (+1.1) and TriviaQA (+1.9) also demonstrated robust improvements.

Data provides the most powerful proof:

Computational Efficiency: To achieve the same performance, AttnRes saves approximately 20% of computational cost compared to traditional residuals (1.25x advantage).
Logical Reasoning: Significant improvements on hard tasks like mathematics and code. For example, in the extremely difficult GPQA-Diamond test, performance improved by 7.5 points.
Stability: Successfully suppressed the numerical explosion of hidden states, allowing deep networks to remain "calm" and "efficient."

Summary: Rethink & Imagine

Viewing infrastructure research from a higher-dimensional perspective, time and space are interconnected.

The idea of "rotating attention by 90 degrees" in this paper seems to have brought some inspiration and reflection to Karpathy.

The residual flow of ResNet is the transmission of information across different spatial depths. The weight flow of SGD (Stochastic Gradient Descent) is the transmission of information across different time dimensions.

The research team felt that ResNet's addition was too simplistic, so they proposed using Attention to filter the output of every past layer. Since SGD is also a form of ResNet, and "Attention is All You Need," why can't we also add Attention to the optimizer?

The vitality of architecture often comes from reflecting on inertia.

When we look back and examine those foundational infrastructures, perhaps we can discover more clever combinations leading to the future amidst the dust of the past.

For more information, please refer to the original paper.

Please contact this public account for authorization to reprint.

Submission or media inquiries: liyazhou@jiqizhixin.com

Rotate Attention by 90 Degrees! Today, Kimi's 'Attention Residuals' Takes Off

Related Articles

分享網址