Stop Obsessing Over Outcome Rewards! CUHK Identifies and Solves the "Information Self-Locking" Problem in RL!

In a nutshell, large models often become less intelligent during training on complex reasoning tasks because a vicious cycle forms between their "questioning ability" and their "information digestion ability." The authors have stripped away the facade of outcome-based Reinforcement Learning (RL), proving that simply adding extremely simple positive and negative critiques at each step to forcibly reallocate advantage values can easily break this "Information Self-Locking" curse. (Original paper title at the end; click "Read Original" to jump to the source link. Published on arXiv on 12 Mar 2026, by The Chinese University of Hong Kong)

When building LLM agents capable of actively searching, asking questions, and solving complex problems, we frequently encounter scenarios where the agent becomes "stupid" or simply "gives up." This paper not only identifies the culprit behind this phenomenon—"Information Self-Locking"—but also provides a clever and lightweight solution.

Phase 1: Identifying Core Concepts

Analysis of the Paper's Motivation
Current Large Language Models (LLMs) have achieved tremendous success in reasoning tasks through outcome-based Reinforcement Learning (e.g., rewarding correct answers and penalizing wrong ones). However, when the task shifts to Active Reasoning—where problem information is incomplete and the agent must collect clues through multiple rounds of questioning to answer—traditional RL fails. The agent gradually stops asking valuable questions and ignores the information it has already collected. It falls into a "low-information" vicious cycle; even increasing the final task reward does not teach it how to ask better questions.

Analysis of the Paper's Main Contributions

Concept Deconstruction: The agent's active reasoning is decomposed into two core capabilities: Action Selection (AS, deciding what to ask) and Belief Tracking (BT, how to digest new clues and update internal hypotheses).
Theoretical Revelation: Theoretically proves the mechanism behind Information Self-Locking. Poor BT masks the contributions of excellent AS (asking good questions is useless if you can't remember the answers), while conservative AS limits the improvement of BT (not asking new questions leaves nothing to learn). The two form a negative confusion effect.
Proposed Method (AReW): Introduces the "Directional Critiques" method. Instead of training complex reward models, it uses simple rule-based positive and negative feedback to directly reallocate advantage values in the policy gradient, successfully breaking the self-lock.
Significant Results: Across 7 datasets in 3 domains, the method not only improved final accuracy (up to 60% improvement) but fundamentally restored the agent's "thirst for knowledge" interaction pattern.

Identification of Understanding Difficulties
The core challenge lies in understanding why the coupling of AS and BT causes RL failure, and how AReW steers the gradient "back on track" without changing the final objective reward. This involves the reallocation of policy gradients and advantage values in RL, representing the most challenging critical node.

Concept Dependency Relationships
Active Reasoning Tasks → Decomposed into AS and BT → Insufficient capabilities in both lead to the "Information Self-Locking" vicious cycle → Traditional advantage allocation fails → Introduction of AReW for advantage reallocation breaks the cycle. The entry point must be placed on the bidirectional coupling mechanism of AS and BT.

Phase 2: In-depth Explanation of Core Concepts

Designing a Life-like Metaphor: The Rookie Detective Case
A police department recruits a rookie detective (the agent) to solve a complex cold case.

Action Selection (AS): The detective interrogates witnesses to collect clues (e.g., asking, "Where were you at the time of the incident?").
Belief Tracking (BT): The detective organizes logic on the station's clue board and updates the list of suspects.
Outcome-based Reinforcement Learning: The chief's assessment method. The chief ignores the process and only looks at whether the real culprit is caught in the end; catching them earns a bonus, failing results in a pay cut.

Establishing Correspondence Between Metaphor and Actual Technology

Occurrence of Information Self-Locking: Initially, the rookie detective occasionally asks a good question (excellent AS) and obtains key clues. However, their logical reasoning ability is too poor to pin the clues on the board (terrible BT). Ultimately, the case remains unsolved, and the chief gives no bonus. The detective starts slacking off, only asking nonsense questions (AS degradation). Conversely, because no new clues are collected, the clue board remains empty, and logical reasoning ability (BT) never improves. The result is a good-for-nothing who neither asks nor thinks.
AReW Solution (Directional Critiques): The police department assigns a veteran forensic expert (Directional Critiques) to follow the detective. The expert does not distribute bonuses (does not change the final reward) but provides immediate verbal feedback after every action. If a good question is asked, the expert praises (AS Critique = +1); if clues are linked correctly, the expert gives a thumbs up (BT Critique = +1). After receiving praise, the detective's advantage value is amplified, clearly indicating that the current step is correct, thereby breaking the vicious cycle.

Deep Dive into Technical Details and Mapping
In traditional PPO algorithms, the agent updates its policy based on the final reward. The AReW method achieves the effect of the forensic expert's immediate praise mathematically by introducing an auxiliary marginal objective. The corrected policy gradient update formula is as follows:

Symbol Replacement Version: The update direction of policy parameters = Expected Value [Summation of all steps in a full episode (Original Advantage Value + Injection Intensity of Critique × Local Directional Critique Score) × Log-probability gradient of the agent making that choice at this step]

Technical Mapping Relationships:

(Original Advantage Value): The credit distributed to this step by the chief based on whether the final case was solved. In Information Self-Locking, this value is often unfair for good actions (close to 0 or even negative).
(Local Directional Critique Score): The forensic expert's verbal evaluation. If this step effectively collects information (or updates beliefs), the score is positive; if it is useless nonsense, it is negative.
(Injection Intensity of Critique): How much weight the forensic expert's words carry.

Summary
AReW does not modify the final objective reward given by the environment but directly performs an additive correction on the advantage value of each step when calculating gradient updates. When the detective asks a good question, even if the final case is botched resulting in an extremely low original advantage, adding the critique score pulls the overall advantage value up. This encourages the agent to make the same good action when encountering similar situations in the future. The local perspective of the forensic expert (Directional Critiques) perfectly compensates for the lag and masking nature of the chief's (Final Reward) global perspective.

Phase 3: Detailed Step-by-Step Process

Step 1: Trajectory Collection (Rollout Generation)

Input: Initial problem setting given by the environment (e.g., a description of a patient's clinical symptoms requiring a disease diagnosis).
Processing: The LLM agent begins multi-round interaction with the environment. The interaction alternates between two types of rounds. Action Round: The agent generates a question (AS) based on its current internal hypothesis, and the environment returns a definite answer. Update Round: After receiving the answer, the agent explicitly outputs confidence levels for various candidate answers (BT).
Output: A complete interaction trajectory data containing the question, environment answer, and updated confidence for each round, until the maximum number of rounds is reached or a final decision is made.

Step 2: Extracting Directional Critique Signals (Critique Assignment)

Input: The interaction trajectory generated in Step 1.
Processing: Instead of calling expensive reward models, lightweight hard rules are used for scoring. For AS nodes, check if the question elicited effective new information (non-repetitive and obtained a valid answer); if yes, the step gets a positive score, otherwise negative. For BT nodes, check if the confidence in the true correct answer increased after receiving valid information; if yes, the step gets a positive score, otherwise negative. Weights are then calculated respectively.
Output: The local critique score corresponding to every time step in the trajectory.

Step 3: Calculating Outcome Rewards and Base Advantage (Reward & Base Advantage)

Input: Trajectory data and the final diagnosis result of the last step.
Processing: Check if the final diagnosis matches the true answer; if consistent, give a final reward (e.g., 1), otherwise 0. Use the standard Generalized Advantage Estimation (GAE) algorithm to reversely calculate the base advantage value allocated to each time step.
Output: The original advantage value for each time step.

Step 4: Advantage Reweighting (Advantage Reweighting)

Input: Original advantage value and local critique score.
Processing: Directly add the two together for correction, calculating the new value. This is a preset hyperparameter for redistribution intensity.
Output: The sequence of corrected advantage values.

Step 5: Policy Optimization (Policy Optimization)

Input: Trajectory data, corresponding old policy probabilities, and the corrected advantage value.
Processing: Feed the above data into the loss function calculation module of standard RL algorithms (such as PPO, GRPO, or GSPO). Use the corrected advantage to guide the model to increase the generation probability of high-advantage actions, and prevent excessive update steps through Clipping.
Output: Updated LLM model parameters. This completes one full training iteration.

Phase 4: Experimental Design and Validation Analysis

Interpretation of Main Experimental Design: Verification of Core Arguments

Core Claim: Traditional outcome-based RL falls into Information Self-Locking; introducing AReW can break this lock and improve final performance in multi-turn reasoning tasks.
Dataset Selection: Covers 7 datasets across 3 domains: preference estimation, medical diagnosis, and fault troubleshooting. Rationality: These tasks all involve missing information that must be obtained through multi-turn questioning, covering both discrete choice and continuous numerical scenarios, fully proving the method's universality.
Evaluation Metrics: Final outcome reward, AS proxy metric (ability to acquire new information), BT proxy metric (degree to which confidence approaches the ground truth). Rationality: Adding local monitoring of AS and BT directly verifies the paper's theoretical hypothesis of "capability decoupling."
Baseline Methods: Direct reasoning (e.g., o4-mini), PPO, and large-scale Group Relative Policy Optimization (GRPO) and GSPO. Rationality: Covers the most cutting-edge and mainstream algorithm systems currently used for training large models.
Experimental Conclusion: In 28 evaluation settings, AReW significantly outperformed traditional baseline methods in 27 settings, supporting the core claim from both qualitative and quantitative perspectives.

Ablation Study Analysis: Contribution of Internal Components

Design Logic: Verify the AS ONLY version (evaluating only questions) and the AS+BT version (evaluating both questions and internal belief updates).
Experimental Conclusion: The AS ONLY version showed improvements in overall performance and internal BT capabilities, quantitatively proving the theoretical link that "better questioning provides sufficient nutrients for BT." The AS+BT dual approach achieved the highest returns in the vast majority of tasks, proving that breaking bidirectional coupling requires simultaneous intervention in both channels, exhibiting an irreplaceable synergistic effect.

Depth/Innovation Experiment Analysis: Insights into Intrinsic Method Characteristics

Visualization of Training Dynamics: Intuitively proves the existence of "Information Self-Locking" and the phenomenon of traditional RL taking shortcuts. Plotting dynamic line charts of training steps versus various metrics reveals that under traditional PPO, Reward rises slowly while AS and BT decline or stagnate; the model learns a "blind guessing shortcut" that relies less on interaction. After adding AReW, all three curves rise in sync.
Multi-Track RL Dimensionality Reduction Test: Proves that Information Self-Locking is a common disease of all outcome-based RL. Testing the recently popular GRPO algorithm found that even though GRPO alleviates variance by increasing sampling volume, it still falls into self-locking. Applying the AReW plugin to GRPO still brought significant improvements.
Directional Critique Noise Stress Test: Verifies robustness when critique rules make errors. During training, correct critique signals are reversed with a certain probability, reaching a noise rate of up to 50%. Results show that even at a high noise rate of 40%, AReW still defeats the original PPO baseline. This perfectly aligns with the mathematical proposition derived by the authors (convergence is possible as long as weighted accuracy is greater than 50%), proving the method possesses strong fault tolerance for engineering deployment.

Paper Title: On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Welcome fellow Deep Learning enthusiasts to exchange, discuss, and collaborate!

Stop Obsessing Over Outcome Rewards! CUHK Identifies and Solves the "Information Self-Locking" Problem in RL!

Phase 1: Identifying Core Concepts

Phase 2: In-depth Explanation of Core Concepts

Phase 3: Detailed Step-by-Step Process

Phase 4: Experimental Design and Validation Analysis

Related Articles

分享網址