"In a nutshell: Still using PPO to fine-tune large models? You might be getting screwed by its core mechanism! This paper finds that PPO's 'ratio clipping' CLIP, used to limit update magnitude, completely fails on long-tailed distribution language models: it brutally prevents the model from exploring novel low-frequency words, yet can't control the violent drift of high-frequency words. The authors propose the DPPO algorithm, which directly uses the 'true distance of the policy distribution' for 'risk control,' thoroughly solving the long-standing problems of low training efficiency and instability. (See the original paper title at the end, click 'Read Original' to jump directly to the original link, Published on arXiv on 05 Feb 2026, by Sea AI Lab, Singapore / NUS.)"
Hi everyone, I've created an academic exchange group (personal group, not an ad), covering various directions like large models/CV/agents/multimodal. Welcome to join the group for exchange and progress together!
Phase 1: Identifying Core Concepts
Analysis of the Paper's Motivation
In reinforcement learning fine-tuning for LLMs (like RLHF), PPO (Proximal Policy Optimization) is the default king. PPO's core mechanism uses a 'ratio clipping' to limit the magnitude of model updates, preventing the new policy from deviating too far from the old policy (i.e., leaving the Trust Region). The authors discovered a long-overlooked structural flaw:
PPO's clipping mechanism was designed for traditional RL (small action spaces) and is not suitable for LLMs (huge vocabulary spaces).
This manifests in two specific problems:
- Excessive penalty for low-probability tokens: The long-tailed distribution of LLMs means many reasonable exploration tokens have very low probabilities (e.g., 0.00001). If the model slightly increases its probability (to 0.0001), the ratio instantly becomes 100x, triggering PPO clipping, causing the model to be unable to learn.
- Inadequate constraint for high-probability tokens: For tokens with extremely high probability (e.g., 0.99), if the probability drops to 0.8, the ratio change is very small, and PPO considers this safe. However, in terms of probability distribution, this is actually a huge mass shift, which could lead to training collapse.
Analysis of the Paper's Main Contributions
- Proposes Divergence Proximal Policy Optimization (DPPO): Abandons PPO's crude 'ratio' as a proxy and directly uses policy divergence (such as Total Variation or KL divergence) to decide whether to update parameters.
- Introduces lightweight approximation methods: To solve the problem of computing full divergence on a vocabulary of hundreds of thousands of words being too memory-intensive, the authors propose Binary and Top-K approximation methods. This makes DPPO's computational overhead almost negligible.
- Establishes a theoretical framework for LLMs: Derives a policy improvement lower bound specifically for LLM generation tasks (finite horizon, no discount, sequence-level reward), bridging the gap between classic RL theory and LLM practice.
- Significant experimental results: On mathematical reasoning tasks (AIME24/25), DPPO significantly outperforms PPO variants like GRPO in both stability and final performance, and can train stably without complex Rollout Router Replay tricks.
Identification of Difficulties in Understanding
- The most challenging concept: The essential difference between Ratio vs. Divergence (Total Variation). It is difficult for beginners to intuitively understand why a 'large ratio change' does not equal a 'large distribution change'. This is key to understanding why PPO fails on LLMs.
- Core concepts that need key explanation:
- PPO's ratio trap: Why is it unfair to the long-tailed vocabulary?
- Binary Approximation: How to compute an approximate divergence without traversing the entire vocabulary?
Conceptual Dependencies
To understand DPPO, you must first thoroughly understand where PPO went wrong through analogy. The path is: understand PPO's flaw → introduce the concept of 'true distance' (TV divergence) → solve the computational challenge (approximation methods) → final solution (DPPO).
Phase 2: In-Depth Explanation of Core Concepts
Key Elements in the Analogy
- Investment Manager: Model policy Policy.
- Investment Portfolio: Vocabulary.
- Blue-chip stocks (high-probability tokens): For example, 'Apple' or 'Microsoft', originally investing 99% of the capital (probability 0.99) here.
- Junk stocks (low-probability tokens): Some unknown small companies, investing only 1 dollar (probability 0.00001).
- Risk Control Rules: Trust Region.
Corresponding Technical Concepts for Each Element
- Investment operation: Model updates parameters, changing the generation probability of tokens.
- Capital change: The amount of change in the probability distribution (Divergence).
- Capital multiple: The probability ratio (Ratio) in PPO.
Explanation of Why These Correspondences Are Reasonable
This analogy captures the long-tailed characteristic of LLMs where 'high-frequency words occupy most of the probability mass' while 'low-frequency words, although having extremely low probabilities, are numerous in quantity', just like the extremely uneven capital allocation in an investment portfolio.
Deep Technical Details
The Core of PPO's Problem
PPO uses the ratio r = π_new / π_old for clipping.
- For low-probability words, the denominator is extremely small, and a slight change in the numerator causes r to explode.
- For high-probability words, the denominator is extremely large, and a large change in the numerator is not obvious in r.
DPPO's Solution
DPPO introduces a dynamic mask based on divergence. It no longer blindly clips but calculates the Total Variation (TV) distance between the new and old policy distributions.
Key Formula: DPPO's Objective Function
Symbol Replacement Version: DPPO's Training Objective = Expected value [ Sum over every token ( Smart Risk Control Switch Probability Ratio Advantage Value ) ]
The core here is that Smart Risk Control Switch (Mask):
Status of the Smart Risk Control Switch:
- Case 1 (Want to add position but exceeds limit): If this action is good (advantage value greater than 0) and is increasing its probability, and the entire portfolio's change amount (Divergence) has exceeded the safety line (δ), turn off the switch (0), prevent update.
- Case 2 (Want to reduce position but exceeds limit): If this action is bad (advantage value less than 0) and is decreasing its probability, and the entire portfolio's change amount (Divergence) has exceeded the safety line (δ), turn off the switch (0), prevent update.
- Other cases: Everything is normal, keep switch on (1).
Solving the Memory Nightmare: Binary Approximation
To compute accurate TV divergence, it is necessary to traverse all tokens in the vocabulary (e.g., 100,000 words) to sum the probability changes. This is too slow and memory-intensive during training. The authors proposed Binary Approximation.
Formula:
Symbol Replacement Version: Simplified Change Amount Calculation = | Probability of current token under old policy - Probability of current token under new policy |
Mutual Mapping of Technical Details and Analogy
- Scenario 1: You discover a potential stock (update of low-probability token): You think that 'junk stock' with only 1 dollar invested has great prospects and decide to increase its position to 100 dollars.
- PPO (Ratio Risk Control Officer): Only looks at the growth multiple. It screams: "Oh my god! You increased the position by 100 times (Ratio = 100)! This is an illegal operation! Rejected!" Result: Beneficial exploration is stifled.
- DPPO (Divergence Risk Control Officer): Looks at the total capital change. It says: "Oh, you only moved 99 dollars, accounting for less than one in a hundred million of the total assets. Approved." Result: The model successfully learned new knowledge.
- Scenario 2: Blue-chip stock crashes (update of high-probability token): You suddenly feel that 'Apple' is not good and decide to reduce the position from 99% to 80%.
- PPO (Ratio Risk Control Officer): Still only looks at the multiple. It calculates: 0.8 / 0.99 ≈ 0.808. It says: "The multiple change is not large, still within the safe range of 0.8 to 1.2. Approved." Result: A huge policy shift goes unnoticed, leading to training instability (Collapse).
- DPPO (Divergence Risk Control Officer): Sees that you moved 19% of the huge capital. It immediately alarms: "Warning! Capital change amount (Total Variation) exceeds the threshold! This is too dangerous, must prevent!" Result: Avoided model collapse.
Summary
PPO's Ratio mechanism is like a risk control officer who only understands elementary school multiplication and division, prone to making a fuss over small transactions, yet blind to the loss of large assets. The TV divergence (and its approximation) introduced by DPPO is a true actuary, focusing on the actual 'capital flow', thus allowing the model to effectively explore long-tailed knowledge while ensuring safety.
Phase 3: Detailed Explanation of Process Steps
Specific Process Pseudocode
Input Stage We have a Prompt dataset (e.g., math problems D), a reference model (Reference Policy, π_ref, usually an SFT model frozen unchanged), and a currently training model (Policy, π_θ).
Step 1: Sampling and Generation (Rollout)
- Take a Prompt p from the dataset.
- Let the current model π_θ generate a complete answer sequence a based on p.
- Key Record: During the generation process, not only retain the generated tokens but also record the predicted probability of these tokens under the old policy π_ref and the predicted probability under the new policy π_θ.
Step 2: Advantage Estimation
- Send the generated answer a to the reward model (or score via rules, e.g., check if the answer is correct for math problems), obtaining a scalar reward r.
- To reduce variance, typically use the GRPO method: generate multiple answers for the same Prompt p, and calculate the advantage A of each answer relative to the average.
Step 3: Compute Divergence and Build Mask (Divergence Masking) Enter the training loop, for each position t in the sequence:
- Input: Current token a_t, old probability π_ref(a_t), new probability π_θ(a_t).
- Compute Ratio: r_t = π_θ(a_t) / π_ref(a_t).
- Compute Approximate Divergence (Binary TV): Directly take the absolute difference: Δ_t = |π_θ(a_t) - π_ref(a_t)|. (Note: If using Top-K approximation, need to compare the probability distributions of the top K high-frequency words).
- Generate Mask M_t:
- Set a hyperparameter δ (e.g., 0.15).
- Check if it violates the trust region:
- If (A_t > 0 and π_θ(a_t) > π_ref(a_t)) or (A_t < 0 and π_θ(a_t) < π_ref(a_t)): This indicates the model is trying to update parameters based on advantage.
- At this time, if Δ_t > δ, it means this update will cause excessive distribution change, and it must be intercepted regardless.
- If the interception condition is met, set M_t = 0; otherwise M_t = 1.
Step 4: Compute Loss and Backpropagate
- Input: Mask M_t, ratio r_t, advantage A_t.
- Compute Loss: Multiply them: L_t = -M_t * r_t * A_t. (Note: Typically minimize negative reward).
- Gradient Update: Compute gradients based on Loss, update the parameters of model π_θ.
Output Updated model parameters, entering the next training step.
Phase 4: Experimental Design and Validation Analysis
Main Experiment Design: Verification of Core Argument
Core Claim: DPPO is more stable than PPO/GRPO and can handle the long-tailed vocabulary problem.
- Dataset and Task: Selected mathematical reasoning tasks, using AIME 2024 and AIME 2025 as test sets. Mathematical reasoning has extremely high requirements for logical chains; the collapse of any step will lead to incorrect results. AIME is a high-difficulty competition problem, very suitable for testing RL stability.
- Baseline Methods:
- GRPO (Clip-Higher): The GRPO algorithm used by DeepSeek-R1, plus the Clip-Higher trick.
- CISPO: Another method attempting to solve the trust region problem.
- Experimental Results: DPPO's Pass@1 accuracy on AIME 24/25 not only rises faster (higher efficiency), but the final converged score is significantly higher than GRPO. The conclusion is that using divergence (Divergence) instead of ratio (Ratio) as a constraint indeed brings tangible performance improvements.
Ablation Analysis: Contribution of Internal Components
Question 1: Is the Trust Region really necessary?
- Ablation Object: Remove all Trust Region constraints (i.e., PG-IS method).
- Result: The model training quickly collapsed, with accuracy dropping to the bottom.
- Conclusion: Trust Region is necessary, especially when the learning rate is extremely low, as the mismatch between training and inference will accumulate.
Question 2: Who should the Anchor be?
- Ablation Object: Compare using π_rollout (Rollout Policy) as the anchor vs. using π_recomputed (Recomputed Policy) as the anchor.
- Result: Using Rollout Policy as the anchor (i.e., DPPO's approach) is very stable; while using Recomputed Policy as the anchor (the default practice in many open-source libraries, like MiniRL) leads to collapse.
- Conclusion: It must be anchored on the model that generated the data, which aligns with the theoretical assumption of On-Policy.
Question 3: Is Binary Approximation sufficient?
- Ablation Object: Compare Binary TV vs. Top-K TV.
- Result: The performance of both is almost identical.
- Conclusion: Simple Binary Approximation has already captured most of the distribution shift information, with extremely high cost-effectiveness.
Deep/Innovative Experiment Analysis: Insight into the Intrinsic Characteristics of the Method
Experiment A: What tokens are mistakenly killed by PPO?
- Design: The authors printed out and analyzed the tokens clipped during training.
- Discovery: The most clipped tokens are all key reasoning words! For example, numbers ("1", "4"), mathematical symbols ("+", "="), logical connectives ("Therefore", "Since").
- Insight: These words usually have low probabilities in the initial model (because reasoning paths are diverse). Once RL finds a correct path and wants to significantly increase the probability of these words, PPO prevents the update in a "one-size-fits-all" manner because the Ratio is too large.
Experiment B: Dissecting the Source of "Instability"
- Design: The authors gradually relaxed constraints to see which updates caused the collapse.
- Discovery: The main driver of collapse is over-updating on negative samples. When the model receives negative feedback, without Divergence constraints, the model will frantically lower the probability of related tokens, leading to "overcorrection" and destroying the model's original language capabilities.
- Conclusion: DPPO's mask mechanism plays a key protective role in handling negative feedback.
This paper's title: Rethinking the Trust Region in LLM Reinforcement Learning
Welcome Deep Learning enthusiasts to communicate, discuss, and cooperate with me!