Wenfeng Liang Signs, DeepSeek Kicks Off New Year with a New Macro Architecture Chapter, Cracking the Gradient Explosion and Memory Wall

On the last day of 2025, DeepSeek released a heavyweight paper signed by Wenfeng Liang.

The paper proposes the mHC (Manifold-Constrained Hyper-Connections) architecture, which successfully solves the training instability caused by expanding residual width by projecting the residual space of hyper-connections onto a doubly stochastic matrix manifold.

Simultaneously, combined with engineering optimizations like kernel fusion and communication overlap, it achieves synchronous improvement in model performance and scale with only a 6.7% additional overhead.

The mHC architecture is effective for large-scale training and provides tangible performance improvements and excellent scalability. This will help deepen the understanding of topology architecture design and propose promising directions for the development of foundation models.

Numerical Storms and System Bottlenecks Caused by Ultra-Wide Residual Flows

The rapid development of deep neural networks over the past decade is largely attributed to the concise and profound design of Residual Connections.

From ResNet to the Transformer architecture that now dominates large language models, Identity Mapping has always been the anchor maintaining the stability of signal propagation in deep networks.

It ensures that signals do not over-attenuate or over-amplify with increasing depth during forward propagation, while also guaranteeing the smooth flow of gradients during backward propagation.

Recently emerging Hyper-Connections (HC) technology attempts to break the limitations of traditional residual connections.

Traditional residual flow width is usually consistent with the dimension of the layer input, limiting information carrying capacity.

HC introduces an expansion factor n, expanding the width of the residual flow to n times the input, building a broader information highway.

This design significantly improves model performance by increasing the complexity of the topology structure without significantly increasing computational volume (FLOPs).

However, this seemingly perfect expansion scheme encountered severe challenges in practical large-scale training.

As network layers are stacked, the identity mapping attribute, originally serving as a stable anchor, is completely destroyed.

In standard residuals, multi-layer transmission can be viewed as the accumulation of multiple transformations, whereas in HC, signal transmission between layers becomes the multiplication of multiple matrices. Since the original HC places no constraints on the multiplying matrices, the composite mapping resulting from these matrix multiplications rapidly deviates from the identity transformation.

Experimental data shows that in 27B parameter model training, the HC scheme experienced severe loss divergence around 12k steps, with gradient norms fluctuating violently.

A more intuitive metric is the Amax Gain Magnitude, which is the amplification factor of the signal after multi-layer transmission.

In HC, this value surges to over 3000 in both forward and backward propagation, indicating that the signal explodes severely in the deep layers of the network, completely destroying training stability.

Besides numerical instability, HC also brings a thick memory wall.

The bottleneck of modern deep learning hardware often lies not in computing power, but in memory access bandwidth (IO). The n-wide residual flow introduced by HC causes the memory read/write volume per Token per layer to increase exponentially.

This huge IO overhead leads to a severe drop in training throughput.

In addition, because the matrices include linear layers with learnable parameters, a large number of intermediate activation values must be saved during backpropagation. This not only occupies precious GPU memory but also forces developers to use Gradient Checkpointing techniques, further increasing the computational burden.

In pipeline parallelism involving cross-node communication, wider residual flows also directly lead to communication data volume multiplying by n, greatly increasing communication bubble time.

Reshaping the Identity Mapping Mechanism Using Doubly Stochastic Matrix Manifolds

Facing the stability challenges brought by HC, DeepSeek did not choose to retreat to simple identity mapping, but instead proposed a more sophisticated mathematical solution: mHC.

The core idea of mHC is to project the learnable mapping matrices in the residual flow onto a specific geometric manifold, allowing them to maintain signal propagation stability like an identity mapping while permitting information exchange between different residual flows like the original HC.

The specific manifold chosen by DeepSeek is the set of Doubly Stochastic Matrices, geometrically also known as the Birkhoff Polytope.

A matrix is called a doubly stochastic matrix if it satisfies three conditions: all elements are non-negative, the sum of each row is 1, and the sum of each column is 1. Constraining the matrix to be doubly stochastic brings a series of extremely superior mathematical properties.

First is the norm-preserving property. The spectral norm (largest singular value) of a doubly stochastic matrix is strictly limited to less than 1. This means that this linear mapping is a non-expansive mapping. After the signal is processed by it, the energy will not be infinitely amplified, thus fundamentally eliminating the risk of gradient explosion.

Second is composite closure. The product of doubly stochastic matrices remains doubly stochastic. This ensures that no matter how many layers the network stacks, the composite mapping from shallow to deep layers remains within the manifold of doubly stochastic matrices, and stability continues throughout the full network depth.

From a geometric perspective, doubly stochastic matrices can be viewed as convex combinations of Permutation Matrices. The mean of features is strictly conserved. This is a very benign signal propagation mechanism. It allows information to visit and fuse between different residual flows, while also limiting the runaway of total signal strength like the law of energy conservation.

When the expansion factor n=1, the doubly stochastic condition degenerates to scalar 1, and mHC naturally falls back to the classic identity mapping. This shows that mHC is a more general form of promotion for residual connections.

To implement this constraint in actual calculations, mHC introduces the Sinkhorn-Knopp algorithm.

Through mathematical reconstruction, mHC successfully tamed the wild hyper-connections.

Experimental results show that after adopting mHC, the signal gain magnitude, originally as high as 3000, was suppressed to around 1.6. This is only due to the tiny error brought by the finite iterations of Sinkhorn-Knopp, representing a reduction of three orders of magnitude.

This difference in magnitude directly translates to a smooth training curve. The model no longer experiences sudden jumps in loss values, and gradient updates become stable and orderly.

Hardware-Software Co-optimization to Break Through Memory Bandwidth and Communication Limits

Mathematical elegance often requires strong engineering support to translate into actual performance.

The n-wide residual flow and Sinkhorn-Knopp iterative calculations introduced by mHC would bring unacceptable memory and time overhead if implemented directly in traditional frameworks.

For this reason, the DeepSeek team conducted a series of deep infrastructure optimizations, from kernel fusion to communication scheduling, turning the impossible into efficiency.

Addressing the memory wall problem, the core strategy is extreme kernel fusion.

In standard PyTorch implementations, operations like RMSNorm, matrix multiplication, and activation functions are executed step-by-step. Each step requires moving data from memory to the chip and back.

This mode is fatal for IO-intensive operations like mHC. DeepSeek utilized the TileLang programming model to develop customized mixed-precision kernels.

The optimization of memory occupancy relies on a fine-grained recomputing strategy. Due to the n-flow residual introducing huge intermediate activation values, memory would instantly be full if all were saved for backpropagation.

mHC adopts a strategy of trading computation for memory, achieving the best balance between memory occupancy and computation time. This strategy effectively controls memory consumption during large model training without sacrificing Batch Size.

Under large-scale distributed training scenarios, Pipeline Parallelism is essential.

However, the n-fold cross-node communication volume caused by mHC is a huge bottleneck.

To solve this problem, DeepSeek extended the DualPipe scheduling strategy.

DualPipe was originally used to overlap computation and communication, but in the mHC scenario, the traditional overlap strategy became invalid because the communication time was too long.

The new scheduling scheme divides the computational flow into normal priority and high priority. To avoid blocking the communication flow, kernels responsible for processing MLP layers (feed-forward networks) are given high priority, and the use of persistent kernels with excessively long runtimes in attention layers is avoided.

This design allows computational tasks to be flexibly preempted, ensuring that communication and computation can be perfectly staggered on the timeline. Even at the boundaries of pipeline stages, efficient masking is achieved by decoupling the dependencies of recomputation and communication.

This series of hardware-software co-optimization has yielded significant results.

In the actual training of the 27B parameter model, compared to the baseline model, the mHC with expansion factor n=4 increased training time overhead by only 6.7%. Considering the performance improvement brought by mHC, this tiny additional cost is almost negligible.

This proves that through deep system-level optimization, complex mathematical structures can run efficiently on existing hardware.

Practical Verification from 3B to 27B and Scalability Analysis

DeepSeek conducted comprehensive verification of mHC on models of different scales. The model architecture is based on Mixture of Experts (MoE), covering parameter scales of 3B, 9B, and 27B, with the expansion factor n uniformly set to 4.

The experiment not only focused on final performance metrics but also deeply investigated training process stability and scaling laws that vary with Compute and Token volume.

In the most core 27B model comparison experiment, mHC demonstrated overwhelming stability advantages. Compared to the baseline model, HC frequently experienced loss value oscillations and violent gradient fluctuations during training, while mHC's training curve was as smooth as walking on level ground. Loss values decreased steadily, and the final validation set loss was 0.021 lower than the baseline.

This is a very considerable gap in the pre-training field, usually indicating a significant improvement in the model's performance on downstream tasks.

Monitoring curves of gradient norms also confirmed this. The gradient behavior of mHC was almost consistent with the most stable standard residual network, completely eliminating the dramatic EKG-like fluctuations of HC.

Evaluation results on downstream tasks further confirmed the effectiveness of mHC.

Across 8 mainstream benchmark tests including BBH, DROP, GSM8K, and MATH, mHC comprehensively surpassed the baseline model and defeated the original HC on most tasks.

Especially on BBH and DROP tasks requiring complex reasoning capabilities, mHC achieved significant improvements of 2.1% and 2.3% respectively.

This indicates that mHC not only repairs training instability, but its feature mixing mechanism introduced through manifold constraints actually enhances the model's ability to process complex information flows and conduct deep reasoning.

Scalability experiments provided broader support for the application prospects of mHC.

In the Compute Scaling Curve, researchers plotted the performance improvement magnitude of mHC relative to the baseline under different computational budgets (corresponding to 3B, 9B, 27B models).

The results show that as model scale and computation volume increase, the performance dividend brought by mHC does not衰减, but always remains in a stable positive range.

This means that mHC is a technology with good scalability and will not fail as the model becomes larger.

Meanwhile, in the 3B model Token Scaling Curve, as the volume of training data increases, mHC always maintains performance pressure over the baseline.

DeepSeek's research has opened a new perspective on macro architecture design.

By introducing geometric manifold constraints, neural networks can significantly increase the complexity of the topology structure while maintaining good mathematical properties.

mHC proves that with correct mathematical constraints and extreme engineering optimization, we can completely break through the performance ceiling of existing architectures without significantly increasing computation and time costs.

This provides a path that is both robust and efficient for the evolution of future trillion-parameter model architectures.

References:

https://arxiv.org/abs/2512.24880

Wenfeng Liang Signs, DeepSeek Kicks Off New Year with a New Macro Architecture Chapter, Cracking the Gradient Explosion and Memory Wall

Numerical Storms and System Bottlenecks Caused by Ultra-Wide Residual Flows

Reshaping the Identity Mapping Mechanism Using Doubly Stochastic Matrix Manifolds

Hardware-Software Co-optimization to Break Through Memory Bandwidth and Communication Limits

Practical Verification from 3B to 27B and Scalability Analysis

Related Articles

分享網址