SortedRL: Accelerates Large Model RL Training by 50%, Boosting Efficiency by 18%

Editor's Note: Reinforcement learning (RL) has become a core method for enhancing the reasoning capabilities of large models. However, the bottleneck of RL training efficiency has long been a stubborn problem—the rollout phase alone can consume up to 70% of the training time. SortedRL employs an elegant scheduling technique to cut this bottleneck in half.

Background: Why is RL Training So Slow?

Current mainstream large model RL training methods (such as GRPO and PPO) face a structural dilemma:

  • Heavy Rollout: Generating chain-of-thought sequences up to 16k tokens long; autoregressive decoding itself is very slow.
  • High Synchronization Overhead: Policy updates can only occur after rollouts are completed, leaving GPUs idle and waiting for long periods.
  • Batch Imbalance: Significant variations in output lengths across different samples cause short samples to wait for long ones, dragging down overall efficiency.

The result is that even with sufficient cluster computing power, actual GPU utilization remains extremely low, with training time largely wasted on "bubbles."

Core Method: Sort by Length, Update While Running

The core idea behind SortedRL is surprisingly simple:

Rollout Buffer
      │
      ▼
[Sort by Output Length] ← Short samples first
      │
      ▼
Short Group → Early Policy Update → Next Round of Rollout Continues
Long Group → Later Batch Updates
      │
      ▼
Stateful Controller (Controls the degree of off-policy shift)

Three Key Designs:

DesignFunction
Online Length-Aware SchedulingUpdates short samples first to eliminate waiting bubbles.
Separation of Large Rollout / Small Update BatchesImproves parallel efficiency and reduces memory pressure.
Cache-Controlled Off-Policy AdjustmentBalances sample freshness with training speed.

Essentially, SortedRL transforms the original serial pipeline of "completing all samples before a unified update" into a streaming schedule of "updating while running." Once a group of short samples is ready, it immediately triggers policy gradient updates without waiting for long sample generation to finish.

Experimental Results: Let the Data Speak

Experiments conducted on LLaMA-3.1-8B and Qwen-2.5-32B covered reasoning benchmarks such as AIME 24, Math 500, and Minerval:

  • Training Bubble Ratio Reduced by > 50%: GPU idle time is significantly compressed.
  • Performance Improvement of 3.9% ~ 18.4%: Achieved under equivalent training volumes compared to baseline methods.
  • Supports 16k Token Long-Chain Reasoning: Maintains chain-of-thought quality without loss.

While improving efficiency, SortedRL also constructs a "near on-policy micro-curriculum"—short samples are updated frequently, while long samples are updated after accumulation. This naturally forms a training rhythm from easy to difficult, helping to stabilize the RL training process.

Conclusion

The value of SortedRL lies not only in speed improvements but also in revealing a counter-intuitive insight: The bottleneck of RL training is not the algorithm, but the scheduling. In large-scale clusters, keeping GPUs constantly busy is just as important as choosing the right reward function. As RLVR (Reinforcement Learning with Verifiable Rewards) becomes the industry standard, the value of system-level optimizations like SortedRL will become increasingly prominent.

Paper Link: https://arxiv.org/abs/2603.23414


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.