Author: Tianle Wang, PhD student in Data Science at City University of Hong Kong, supervised by Prof. Ning Miao[1], research focus on large language model reasoning.

DeepSeek-R1's popularity has brought RLVR (Reinforcement Learning with Verified Reward) back into the spotlight for large model post-training. However, anyone who has reproduced R1-Zero or similar pipelines knows that RLVR is extremely expensive—it not only requires thousands of training steps, but as the model's chain-of-thought (CoT) grows longer, the computational cost of each subsequent step increases exponentially.

Do we really need to run through the lengthy RL training step by step?

Today, we introduce a newly posted ArXiv paper 《Not All Steps are Informative: On the Linearity of LLMs' RLVR Training》.

Paper: https://arxiv.org/abs/2601.04537
Code: https://github.com/Miaow-Lab/RLVR-Linearity

This work reveals a counterintuitive phenomenon: During RLVR, the LLM's weights and output probabilities exhibit surprisingly linear changes!

若影片無法播放，請改看來源頁。

Based on this discovery, we propose a "weight extrapolation" method that directly 'calculates' future models without training, achieving up to 6.1x training acceleration.

01. Counterintuitive Discovery: Is RLVR Training "Linear"?

Transformer itself is a highly nonlinear complex system. Intuitively, we think its parameter update trajectory should be winding. However, through our analysis of the DeepSeek-R1-Distill series models under various RL algorithms (GRPO, Reinforce++, GSPO), we discovered a surprising fact:

1. Linear Weight Changes

As RL training steps increase, the changes in model weights show a strong linear correlation with step number. In experiments, over 80% of parameters have R² (coefficient of determination) greater than 0.7, with most concentrated around 0.9.

This means that the model's state at step 1000 can almost be drawn by connecting a straight line between step 100 and step 200!

2. Linear Changes in Output Log-Prob

What's even more magical is that this linearity not only exists in parameter space but is directly reflected in the model's output behavior. For the same prompt, the Log-Probability of the model generating specific tokens also changes linearly with training steps.

Mostly connectives (like "wait", "but"): Probability changes linearly. Increase indicates the model has learned behaviors like reflection and transition; decrease indicates wrong paths.

(Figure caption: Left shows the distribution of weight R², right shows changes in token log probabilities, revealing a clear linear trend)

What does this mean?

This suggests that current RLVR training may not be "continuously exploring new strategies" in later stages, but rather determines the optimization direction early in training. The remaining thousands of steps are more about simply amplifying this trend.

02. Why Does This Happen?

We provide a theoretical explanation in the paper. Simply put:

• Low Learning Rate & Large Batch Size: RLVR typically uses very small learning rates (< 1e-5) and large batch sizes (plus rollout numbers).
• Adam Optimizer Characteristic: With relatively stable gradient directions, the Adam optimizer tends to produce constant update step sizes.
• First-Order Dominance: Although Transformer is nonlinear, when parameter changes are small, output changes are mainly dominated by the first-order term of weight changes (first-order Taylor approximation), while the influence of second-order terms (H) is minimal.

This "linearity" essentially indicates: Most of RLVR's computational load might be reinventing the wheel.

03. How to Utilize This Property? From "Extrapolation" to "Alternating Training"

Since we've verified that RL training trajectories exhibit strong linear characteristics, we can be bolder: Skip those redundant intermediate steps and directly "calculate" future models.

We propose three utilization strategies:

1. Logit Extrapolation

This is a trick that "predicts the future" without additional training. Since we've verified LLM training trajectories are linear, we only need to take logits from two early checkpoints (W_t1 and W_t2), then calculate the output distribution for a future step (t3) using a simple linear formula:

logits_{t3} = (1+α) * logits_{t2} - α * logits_{t1}

where α is the amplification coefficient.

Experimental finding (surprise): This doesn't just simulate the future; it even surpasses it.

Experimental data shows that Logit Extrapolation achieves consistent performance improvements on AIME and LiveCodeBench. More importantly, it effectively suppresses the common "Entropy Collapse" and overfitting issues in later RL training stages.

Simply put, it helps the model "filter out" noise from later training, achieving performance about 3% higher than running through all the training steps faithfully.

2. Weight Extrapolation - Directly Predicting Parameters

If Logit Extrapolation predicts outcomes, Weight Extrapolation directly predicts the model itself:

W_{t3} = (1+α) * W_{t2} - α * W_{t1}

Experimental finding (inverted U-curve):

Fixing early checkpoints and attempting backward extrapolation for different steps reveals an interesting "inverted U-shape" phenomenon:

Within a certain range (e.g., several hundred steps), the directly calculated model's performance is completely comparable to actual training; but if the step is too large (e.g., extrapolating from step 300 directly to step 2000), performance first rises then falls.

This indicates: although the general direction is linear, the model still needs minor directional adjustments during long journeys; pure linear extrapolation has its limits.

3. RL-Extra (Alternating Training) - The Core Move

To address pure extrapolation errors over long distances, we propose RL-Extra: "Run a few RL steps to calibrate direction -> Extrapolate a large chunk forward -> Run a few more RL steps to calibrate -> Extrapolate again."

The core idea is: "Use a small amount of real RL training to calibrate direction (Grounding), use large amounts of linear extrapolation to accelerate progress."

This is a cyclical process (Cycle K):

1. Calibration Stage (Grounding): Perform k steps of normal RL gradient updates (e.g., GRPO) to ensure the model learns correct Reward signals and corrects optimization trajectory.
2. Acceleration Stage (Extrapolation): Based on the determined direction, directly linearly extrapolate N steps in weight space.
3. Cycle: Return to RL updates again to correct direction, then extrapolate.

04. Experimental Results: Free Computation, Same Effect

On authoritative benchmarks like AIME24, MATH-500, and LiveCodeBench, RL-Extra demonstrates astonishing efficiency:

• Speed Boost: To achieve the same AIME24 accuracy (e.g., 38%), standard RL requires 1100 steps, while RL-Extra (20 RL steps + 100 extrapolation steps) only needs 180 steps of real RL computation.
• Overall Speedup Ratio: Achieved 6.1x Wall-clock speedup!
• Performance Lossless: Under various compute budgets, RL-Extra's performance matches or exceeds standard RL training.

(Figure caption: RL-Extra consistently outperforms standard RL under the same training budget)

This reconfirms: A large portion of computational steps in RL training is actually just "linear repetition" and can be completely replaced by mathematical extrapolation.

05. Summary and Reflection

This article not only provides a practical acceleration trick but, more importantly, makes us re-examine the training mechanism of RLVR.

1. Low Information Density: Most steps in existing RLVR training have minimal information increment; they merely mechanically execute predetermined routes.

2. Direction Matters Most: Early-stage direction exploration (Exploration) in training might be more important than we thought. Once direction is determined, the rest is linear "execution."

3. Generality: This conclusion has been verified across multiple base models (Qwen, Llama, DeepSeek) and various algorithms (GRPO, Reinforce++).

For resource-limited friends wanting to try reproducing DeepSeek-R1 or training vertical domain reasoning models, RL-Extra is definitely a "cost-saving" solution worth trying.

One More Thing:

If your GPU is burning, try plotting your Checkpoints first. Maybe your model is also walking on a straight broad road, waiting for you to "extrapolate" it!

References

[1] Ning Miao: https://www.ningmiao.space/

Are LLM RL Training Trajectories Actually Linear? Miaow Lab's Latest Work: Directly 'Predict' Future Models Without Further Training!