Hello fellow rollers, I'm rumor.

In the early years of the Transformer's rise, an idea was proposed: Why not increase computational load simply by having activation values cycle multiple times through the same set of layers, without adding any parameters?

This concept is known as the looped transformer (or Recurrent Depth Model, RDM). Theoretically, it's elegant: inference can elastically adjust the number of loops, memory usage remains fixed, and it natively supports test-time scaling.

In practice, however, training such models has been notoriously unstable—plagued by residual explosions, loss spikes, and extreme sensitivity to hyperparameter choices.

This paper from UCSD and Together AI, titled Parcae: Scaling Laws For Stable Looped Language Models^[1], finally clarifies the source of this instability from a control theory perspective and proposes a stable version called Parcae.

Where Does the Instability Come From?

The authors of Parcae adopt a particularly clever perspective: viewing each update step of the looped Transformer as a dynamical system within control theory.

Diagram illustrating the dynamical system view of looped transformers

In each loop, the model's hidden state h undergoes a Transformer layer calculation to become a new h. We can decompose this process into three parts:

The hidden state from the previous round, h_t, is transformed by a matrix and passed to the next round;
The initial input embedding e is continuously injected into every round of calculation to keep the model on track;
The remainder consists of non-linear calculations within the Transformer, such as attention and MLP.

Written as a formula, this process looks like the following:

Where:

h_t is the hidden state at the t-th loop;
e is the input embedding output by the pre-module P;
A is the state transition matrix, controlling the propagation of the previous hidden state;
B is the input injection matrix, controlling the influence of input e on the current state;
f represents the non-linear part of the Transformer module (Attention + MLP).

The most critical component here is the matrix A that controls the propagation of the previous state. To use the most vivid analogy: this looping process is like rolling a snowball.

The hidden state h is the snowball in your hand;
Each loop is the snowball rolling once in the snow;
The matrix A is the amplification factor for each roll of the snowball.

What happens if the amplification factor of A is greater than 1? With every roll, the snowball gets slightly bigger. The first roll takes it from fist-sized to bowl-sized; the second to washbasin-sized. After a dozen rolls, it becomes as large as a small hill, and finally—it "explodes." Numerical overflow occurs, and the model diverges.

So, how do we make this snowball roll stably without exploding?

Classic control theory provided the answer long ago: For such a recurrent linear system to remain stable, the spectral radius of matrix A must satisfy ρ(A) < 1.

Here, the spectral radius can be understood as the absolute value of the matrix's largest eigenvalue. As long as this value is less than 1, the snowball will, at most, maintain its original size with each roll, or even shrink. It will never grow indefinitely, thus preventing an explosion.

Looking back at previous looped architectures, the problem immediately exposes itself:

Models using addition for input injection set A directly as the identity matrix I. The spectral radius equals 1, resulting in "marginal stability"—the slightest disturbance can cause an explosion.
Models using concatenation-projection for input injection leave A completely unconstrained. During training, the model easily learns a matrix with a spectral radius greater than 1, leading directly to divergence.

The authors empirically verified this conclusion: All models that diverged during training learned a matrix A with a spectral radius ≥ 1; conversely, models that converged stably maintained a spectral radius strictly less than 1.

Thus, the stability puzzle that has plagued looped architectures for so long is finally solved.

Parcae's Targeted Solution

Since the root cause is identified, the solution follows logically: Since instability arises from the uncontrolled spectral radius of A, we must strictly constrain it to ensure its spectral radius remains永远 less than 1.

This is the core design of the Parcae architecture proposed in the paper. There are no flashy tricks; every step hits the pain point directly. Moreover, it adds only a negligible amount of extra parameters, truly achieving "scaling without adding parameters."

1. Shackling the State Transition Matrix to Guarantee Stability from the Root

The authors designed a special parameterization for matrix A: first, construct a continuous-domain matrix as a negative diagonal matrix, then convert it to the discrete loop's A using the standard Zero-Order Hold (ZOH) method from control theory:

The brilliance of this design lies in the fact that all eigenvalues of a negative diagonal matrix are negative. After ZOH discretization, the resulting A inevitably has a spectral radius less than 1. This mathematically guarantees system stability, ensuring the snowball never grows uncontrollably.

2. Normalizing the Input to Eliminate Late-Stage Loss Spikes

Constraining A alone isn't enough. The authors found that large models still occasionally exhibit loss spikes in later training stages. Investigation revealed the issue lies with the input injection e—if the values of input e are too large, they can also cause the hidden state to explode suddenly.

The solution is simple: add a normalization layer before the input e enters the loop to strictly control its value range. This single, small change completely flattens the loss spikes in late-stage training.

3. Optimizing Training Sampling for Smoother Learning

Previous looped models sampled loop counts by batch during training—all sentences in the same batch used the same number of loops. This caused the model to inaccurately estimate the expectation of loop counts, leading to significant loss fluctuations during training.

Parcae switches to Per-Sequence Depth Sampling: within the same batch, each sentence independently samples its own loop count. Consequently, the model learns the distribution of different loop counts more accurately. Training becomes smoother, loss fluctuations vanish, and generalization across different loop counts improves significantly.

How dramatic is the effect of these measures?

Graph showing convergence stability across different learning rates

The authors conducted an experiment across 5 groups of learning rates from 2e-4 to 1e-3:

The original RDM model only converged at the lowest learning rate of 2e-4;
RDM with residual normalization could only converge at learning rates of 4e-4 and below;
Parcae, however, converged stably across all 5 learning rate groups.

The longstanding issue of hyperparameter sensitivity in looped models is thus resolved.

Performance Results

With stability issues addressed, the true power of the looped architecture is finally unleashed. The authors conducted full-scale experiments ranging from 140M to 1.3B parameters. Comparing against standard Transformers with equivalent parameter counts, the results are striking: Parcae with 770M parameters achieved a Core score of 25.07, nearly tying the 1.3B standard Transformer (25.45). This means halving the parameter count while maintaining memory usage, with a performance gap of only 0.38 points.

Comparison chart between Parcae and standard Transformer performance

P.S. The paper does not compare the inference efficiency of Parcae versus Transformers. For instance, while 770M Parcae (8 loops) matches the effect of a 1.3B Transformer, the former requires 8 loops. Differences in single-token inference latency, throughput, and memory bandwidth usage remain to be verified.

Compared to RDM, the previous best looped model, Parcae improved the average score on downstream tasks by up to 1.8 points.

Simultaneously, the authors performed ablation studies on the three optimization points mentioned above: Constrained A prevents divergence at T=4/8, Per-Seq. Sampling reduces the variance of loss spikes, and Prelude Norm improves global quality while resolving late-stage spikes:

Scaling Laws for Looped Models

If solving the stability issue is like opening up the meridians for the looped architecture, the most valuable long-term contribution of this paper is that it systematically derives the Scaling Laws for looped architectures for the first time.

The number of loops is the third independent and predictable scaling dimension, alongside parameter count and data volume.

How to Spend Compute Most Cost-Effectively During Training?

The authors conducted extensive isoFLOP (fixed total compute) experiments and ultimately discovered: under a fixed compute budget, the optimal training strategy is not to pour all compute into data, but to synchronously increase both the number of loops and the training data volume.

Both scale according to a strict power law as total compute increases:

In plain terms: Whenever your total training compute doubles, the optimal number of loops should increase by about 32%, and the optimal training data volume should increase by about 72%. This combination yields the best model performance.

Experiments confirmed that for the same compute budget, the strategy of "increasing loops + reducing data" achieves lower loss and better results than "low loops + dumping all data."

This opens a completely new path for large model training: If your compute is limited and you cannot stack larger models or more data, you can achieve better results at a lower cost simply by increasing the number of loops.

What is the Most Cost-Effective Number of Loops for Inference?

Beyond training, the authors also found that increasing loop counts during inference follows a saturating exponential decay law for performance gains:

This formula indicates that initially, adding a few loops yields significant performance improvements. However, as the loop count increases, marginal returns diminish rapidly, eventually converging to a minimum loss lower bound L_∞, beyond which no further improvement is possible.

Crucially, this lower bound is determined by the number of loops used during training—the more loops used in training, the lower this bound, and the higher the ceiling for inference performance. It is impossible to rely on infinite looping during inference to break through the performance ceiling established during training.

Most brilliantly, the authors integrated the scaling laws for training and inference into a unified formula. This formula can accurately predict model performance under different compute budgets and loop counts, with a prediction error of only 0.85%-1.31%. In the future, researchers training looped models won't need to blindly 试 hyperparameters; they can simply use this formula to calculate the optimal方案.

Unified scaling law formula and prediction accuracy

P.S. All experiments in the paper were capped at 1.3B parameters and 104B tokens, far smaller than the scale of mainstream industrial large models. Whether the stability and parameter efficiency advantages hold on larger models remains to be verified.

Summary

Overall, this paper is undoubtedly a milestone work in the direction of looped architectures. It not only solves the long-standing training instability problem of looped Transformers from a theoretical root but also provides complete scaling laws, opening up new 想象 space for the direction of "improving model effects without stacking parameters."

However, the paper's core stability conclusions are derived based on linear approximations, implying theoretical 前提 boundaries. For complete non-linear systems with attention and activation functions, a spectral radius < 1 is a necessary but not sufficient condition for stability. Furthermore, in terms of performance, there is no comparison with Transformer models after post-training.

Although there are still many gaps to fill and it is some distance from large-scale industrial deployment, where the optimal solution lies once the three scaling axes of parameters, data, and loop counts are simultaneously unleashed—this is the truly interesting question, and Parcae has just opened the door.

References

[1] Parcae: Scaling Laws For Stable Looped Language Models: https://arxiv.org/abs/2604.12946

Author profile image

I am rumor, an AI algorithm researcher who is both punk and geek.

Large Model Algorithm Researcher, Google Developer Expert.

Welcome to follow me; I'll take you to learn and grind through the code.

Let's spin, jump, and blink together in the era of Artificial Intelligence.

"Give a loop of likes" Like button image

Scaling Laws for Looped Transformers