Li Fei-Fei's Team Is Tackling This: From Entropy to Mutual Information, RAGEN-2 Reshapes Reasoning Quality Standards, Preventing AI Agents from Becoming 'More Trained, More Templated'

There has been a noticeable shift in the recent AI research community. Researchers are no longer satisfied with merely making large models "say the right thing"; they are now asking how to make them "think the right thing." Especially when Large Language Models (LLMs) are embedded into multi-turn Agent frameworks, the model is no longer outputting answers in a single shot but must observe, think, act, and re-think like a human. Once this process enters the Reinforcement Learning (RL) stage, training becomes a "marathon of reasoning quality."

What Li Fei-Fei's team has recently done is unearth the most hidden and dangerous problem within this marathon.

01 Why RAGEN-2 Deserves a Complete Rewrite

In the past few years, the stability of Agent training has almost entirely relied on two metrics: reward and entropy. Reward represents whether the outcome is good, while entropy represents whether the reasoning process is diverse. The default assumption has been that if these two metrics are stable, the model training is healthy.

The emergence of RAGEN-2 has completely overturned this logic.

The research team tells us: Entropy is actually a very misleading illusion. A model's reasoning process can quietly and systematically collapse while "entropy looks perfectly normal." What you see is the model appearing to "think" seriously, but in reality, it is no longer listening to the input; it is merely repeating a set of fixed templates.

This is the core problem proposed by RAGEN-2: Reasoning Collapse.

To capture this hidden collapse, the research team proposed two key tools. One is the Mutual Information Proxy (MI Proxy), used to judge whether the model's reasoning truly depends on the input. The other is the Signal-to-Noise Ratio View (SNR View), used to explain why RL pushes models toward "templated reasoning."

The team behind this project is also incredibly star-studded. The core team comes from Northwestern University, in collaboration with Stanford (Li Fei-Fei, Yejin Choi, Jiajun Wu), Microsoft, Oxford, Imperial, UIUC, and other institutions.

You can view the complete materials and code on the project homepage here: https://ragen-ai.github.io/v2/

02 What is Reasoning Collapse? Why Wasn't It Discovered Before?

The term "Reasoning Collapse" sounds a bit abstract, but it actually describes a very intuitive phenomenon: the model appears to be thinking seriously, but its thought content has absolutely nothing to do with the input.

It's like asking someone, "How is the weather in Shanghai today?" and they respond every time with, "Let me think this task through step by step." You might feel they are thinking, but in reality, they aren't listening to what you are saying at all.

RAGEN-2 systematically exposes this phenomenon of "fake thinking."

The Blind Spot of Traditional Metrics: Entropy Only Sees "Internal Diversity"

Why didn't anyone discover Reasoning Collapse before? Because everyone has been staring at entropy.

The metric Entropy H(Z|X) can only see "within the same input, is the model's reasoning diverse?" If the model generates many different reasoning chains for the same input, the entropy will be high.

The problem is, entropy has no idea whether these reasoning chains are actually related to the input.

This leads to a very dangerous situation: The model's entropy looks healthy, but its reasoning has completely detached from the input, entering a state of "templated self-talk."

The research team used a very key formula to explain why entropy is insufficient:

Formula showing the relationship between Entropy, Conditional Entropy, and Mutual Information

Entropy is just the second term on the right. What truly measures "whether reasoning depends on the input" is the Mutual Information I(X;Z).

In other words, high entropy does not mean good reasoning; it might even be masking that reasoning is collapsing.

Definition of Template Collapse: High Entropy + Low Mutual Information

RAGEN-2 names this phenomenon "Template Collapse."

Its characteristics are very distinct: the reasoning chains look rich, but they are almost identical across different inputs. The model seems to have memorized a set of "universal reasoning templates." No matter what you ask, it always starts with: "Let me think step by step…" or "I need to solve this task carefully."

These sentences look like reasoning, but they do not depend on the input at all.

This is not an accident; it is a systematic failure mode of multi-turn Agent RL.

The Four-Quadrant Reasoning State Diagram: Entropy × Mutual Information

The research team divided reasoning states into four categories, which is particularly intuitive.

  • When Entropy is high and Mutual Information is high, the model's reasoning is both diverse and dependent on the input. This is the ideal state.
  • When Entropy is high and Mutual Information is low, it is Template Collapse. The model looks like it's thinking, but it's actually "reciting a script."
  • When Entropy is low and Mutual Information is high, the model's reasoning is very dependent on the input but overly certain, like rote memorization.
  • When Entropy is low and Mutual Information is also low, it is complete degradation; the model is neither diverse nor listening to the input.

Among these four states, the most dangerous one is Template Collapse because it is the easiest for Entropy to "disguise" as a healthy state.

Diagram showing four quadrants of reasoning states based on Entropy and Mutual Information

Figure 1 | Left: Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. Right: Four reasoning mechanisms described along two axes: Conditional Entropy H(Z|X) (within input diversity range) and Mutual Information I(X;Z) (input dependency).

03 RAGEN-2: Reconstructing Reasoning Quality from a Mutual Information Perspective

If the first contribution of RAGEN-2 is "discovering the problem," then the second contribution is "redefining what reasoning quality means." In the past, we relied too much on entropy, thinking that diverse reasoning meant the model was thinking seriously. But RAGEN-2 tells us that diverse reasoning does not equal effective reasoning; it might even be an illusion of reasoning collapsing.

What truly measures reasoning quality is Mutual Information (MI).

This point was clarified in the research using a very classic information theory formula:

Information theory formula decomposing total entropy

The meaning of this formula is very straightforward. The left side is the total entropy of reasoning, and the right side is divided into two parts.

H(Z|X) represents "diversity within the same input." I(X;Z) represents "whether reasoning truly depends on the input." In the past, everyone only looked at H(Z|X), i.e., "is the reasoning diverse?" But what is truly important is I(X;Z), i.e., "is the reasoning listening to the input?"

It's like watching a student write an essay; writing flowery prose doesn't mean they understand the prompt. MI is the key to judging whether they have actually read and understood the question.

RAGEN-2's contribution is pulling MI out of theory and turning it into a metric that can be monitored in real-time during training.

MI Proxy: How to Estimate Mutual Information in Real-Time During Training?

Mutual Information itself is very difficult to calculate directly because reasoning chains are high-dimensional discrete sequences. RAGEN-2's brilliance lies in not hard-calculating MI, but designing a set of "Mutual Information Proxy Metrics" that can be estimated using data from the training process itself.

The core method is called In-Batch Cross-Scoring.

Simply put, it takes each reasoning chain Zᵢ,k and performs a "matching score" against all inputs Xⱼ to see which input it looks more like it was generated from.

If the reasoning truly depends on the input, then Zᵢ,k will score highest on its own input Xᵢ. If the reasoning has become templated, then it will score similarly across all inputs.

The research team 拆解 (decomposed) this score into two quantities: matched: the log-prob of the reasoning on the real input; marginal: the log-prob of the reasoning on the mixture of all inputs.

The difference between these two quantities is the shadow of mutual information.

Based on this idea, the research team proposed two main metrics:

  • Retrieval-Accuracy: Looks at whether the reasoning chain can "recognize its own input." If the model collapses, this accuracy drops to random levels.
  • MI-ZScore-EMA: Turns matched − marginal into a continuous metric, adding z-score and EMA smoothing. It is more stable and better suited for training monitoring.

Most critically, these metrics require no extra models and no extra inference; they can be calculated during the training process itself.

This transforms MI from a "theoretical concept" into an "engineering-ready monitoring signal."

Strong Correlation Between MI and Task Performance

There is a very shocking discovery in RAGEN-2's experiments.

The correlation between MI and the final task success rate is very high. The correlation between Entropy and task success rate is not only low but can even be negative.

In other words, the higher the entropy, the worse the task performance might be. It's like seeing a person speaking more and more fluently, but the content becomes more and more absurd.

This indicates that entropy is not only unreliable but can also mislead training judgments. MI is the metric that can truly tell you "whether the model is thinking seriously."

What RAGEN-2 has done here is essentially transform "reasoning quality" from a vague concept into a quantifiable, monitorable, and optimizable metric system.

04 The Root Cause of Reasoning Collapse: The SNR (Signal-to-Noise Ratio) Mechanism

If the MI Proxy is the "diagnostic tool," then the SNR theory is the "etiological analysis." The third major contribution of RAGEN-2 is explaining why RL causes model reasoning to collapse.

This part is where the research team shows the most insight.

Diagram of Signal-to-Noise Ratio in RL updates

Figure 2 | Schematic of Signal-to-Noise Ratio (SNR) in RL updates. Left: Total gradient decomposed into task gradient (sharpens as input reward variance increases) and regularization gradient. Correct high reward variance produces strong task gradients and better convergence (high SNR); low reward variance makes regularization gradients dominant, producing unstable updates and input-independent reasoning (low SNR).

Key Finding: Reward Variance Determines Task Gradient Strength

The research team's experimental findings are very clear.

When the reward variance for an input is high, the model can learn useful signals from different trajectories; the task gradient is strong, and reasoning naturally depends on the input.

When the reward variance is low, the model learns almost no useful differences; the task gradient is weak, and the regularization terms (KL + Entropy) become the dominant force.

This leads to reasoning being "pushed toward templating."

High Reward Variance → Strong Task Signal → Reasoning depends on input.
Low Reward Variance → Weak Task Signal → Regularization dominates → Reasoning becomes templated.

This is the fundamental trigger for Reasoning Collapse.

Gradient Decomposition: Task Signal vs. Task Noise vs. Regularization Noise

The research team decomposed the RL gradient into three parts:

Formula decomposing gradients into signal, task-noise, and reg

g_signal is the truly useful task signal, g_task-noise is sampling noise, and g_reg is the KL and entropy regularization term.

When reward variance is low, g_signal approaches 0. But g_reg does not decrease at all; it is an input-independent "uniform contraction force."

Thus, g_reg becomes the dominant force, pulling reasoning toward an "input-independent template."

This is why the model exhibits the phenomenon of "looking like it's thinking, but actually reciting templates."

Graph showing gradient norms across reward variance buckets

Figure 3 | Prompts divided into six equal-sized reward variance buckets Q1-Q6. We find: (a) Task gradient norm increases monotonically with bucket RV; (b) When RV approaches 0, task gradient still exists despite carrying almost no useful signal; (c) Regularizer gradient norm (KL+Entropy) is flat across buckets. This directly supports the SNR mechanism under two algorithms.

The Danger of Low Reward Variance

The most dangerous part is that even when reward variance approaches 0, the gradient norm is still not 0.

Because the regularization term is still "pushing hard."

This means the model will continue to update, but the update direction has nothing to do with the task. Reasoning will deviate further and further from the input, becoming more and more templated.

This is the root cause of Reasoning Collapse, and also why Entropy misleads training judgments.

05 The Solution: SNR-Aware Filtering

After RAGEN-2 dug out this hidden problem of "Reasoning Collapse," the next most critical question was how to solve it. The answer given by the research team is very engineering-oriented. They did not introduce complex new models or modify the core structure of RL. Instead, they proposed a lightweight, nearly zero-cost strategy—SNR-Aware Filtering.

Workflow diagram of SNR-Aware Filtering

Figure 4 | SNR-Aware Filtering workflow. In each training iteration: (1) Rollout generation collects trajectories; (2) Instant reward intra-variance is calculated as an SNR proxy; (3) Prompts are ranked by RV, keeping the top-p scores, and policy updates are performed only on the high-signal subset. This filtering loop prevents updates on noisy rollouts and requires no extra models/rollouts beyond standard RL.

The core idea of this method is actually very simple. Since the root cause of reasoning collapse is "low reward variance leading to weak task signals and regularization-dominated updates," we should let the model learn only from samples with "high reward variance." In every training step, only retain those prompts that can truly provide task signals, and filter out those prompts where the reward variance is almost zero and only bring regularization noise.

It's like trying to hear someone speak in a noisy room; you move closer to the person with the clearer voice rather than letting all the noise flood your ears.

What SNR-Aware Filtering does is let the model "get close to the signal and stay away from the noise."

Core Idea

In every batch of data during training, there will be some "high variance, high signal" prompts, and also some "low variance, low signal" prompts. The problem with the latter is that their rewards have almost no difference, causing the task gradient to be nearly zero, but the regularization term is still pushing hard, so the model gets pulled toward "templated reasoning."

The approach of SNR-Aware Filtering is to retain only the top-p prompts with the highest reward variance in each training session, filtering out all low-variance prompts.

High Variance means High Signal.
Low Variance means High Noise.

Filter out the noise, keep the signal, and the reasoning structure can naturally maintain input dependency.

Method Process

The research team's Figure 4 draws the whole process very clearly, but we can explain it in simpler terms.

At the start of training, the model samples multiple trajectories as usual. Each prompt gets a set of reward values. Then, calculate the reward variance for each prompt. Sort all prompts by variance from high to low. Keep the top-p portion, and discard the rest. Finally, use only these "high-signal prompts" to update the model parameters.

The whole process requires no extra models, no extra inference, and no extra computing power. It is just a "sorting and filtering by signal strength" of the training data.

But the effect is very significant.

Why is it Effective?

The effectiveness of SNR-Aware Filtering comes from a very intuitive mathematical fact.

In the RL gradient decomposition:

Gradient decomposition formula highlighting signal and regularization components

When reward variance is low, g_signal approaches 0. But g_reg does not decrease at all; it is an input-independent "uniform contraction force." Thus, g_reg becomes the dominant force, pulling reasoning toward an "input-independent template."

The role of SNR-Aware Filtering is to filter out all prompts where g_signal ≈ 0, allowing the model to update only on samples where g_signal is strong enough.

This brings three direct effects:

  • Gradient SNR is significantly improved.
  • Task signals are preserved.
  • Regularization noise is suppressed.

The final result is that the model's reasoning becomes "listening to the input" again, Mutual Information (MI) rises, and templated reasoning is suppressed.

This is a very "engineering-friendly" solution. It doesn't require changing the model or the algorithm, only the way training data is selected.

06 Experiments: Validation Across Tasks, Algorithms, and Scales

The experimental part of RAGEN-2 is very solid, covering seven environments, four types of RL algorithms, and multiple model scales. It does not prove its method works on a small toy task, but validates the universality of reasoning collapse and the effectiveness of SNR Filtering in a whole set of real, multi-modal, multi-turn, multi-decision environments.

Graphs showing training dynamics under different intervention strategies

Figure 5 | Training dynamics under different intervention strategies. (a) Task success rate, (b) MI proxy (retrieval accuracy), and (c) Reasoning entropy. Without filtering, MI degrades early at the entropy peak, and signal templates collapse. Filtering effectively mitigates the drop in retrieval accuracy, with top-p SNR-Aware filtering best preserving task performance and reasoning diversity.

Seven Environments Covering Multi-Modal, Multi-Task, and Multi-Decision Types

The seven environments selected by the research team are very representative, covering almost all key scenarios in current Agent research.

  • Sokoban: Irreversible planning task, testing the model's long-term reasoning ability.
  • FrozenLake: Stochastic navigation task, testing strategy stability in uncertain environments.
  • MetaMathQA: Mathematical reasoning task, testing the model's symbolic reasoning ability.
  • Countdown: Arithmetic construction task, testing the model's combinatorial reasoning ability.
  • SearchQA: Multi-turn retrieval task, testing the model's information integration ability.
  • WebShop: Web navigation task, testing the model's tool usage and decision-making ability.
  • DeepCoder: Code synthesis task, testing the model's programmatic reasoning ability.

The commonality of these tasks is that they all require the model to maintain a stable, input-dependent reasoning structure during multi-turn interactions.

RAGEN-2's experiments show that Reasoning Collapse is 普遍 (prevalent) in these tasks, and SNR Filtering is universally effective in them.

Key Experimental Phenomena

Three most important phenomena in the experiments are worth emphasizing.

  1. The decline in Mutual Information (MI) happens earlier than the decline in performance; it is a more sensitive diagnostic indicator.
  2. Entropy remains high during the collapse process, completely failing to reflect the problem.
  3. SNR Filtering significantly improves both MI and task success rates.

This indicates that the MI Proxy is not just a "good-looking metric," but a signal that can truly warn of reasoning collapse in advance.

And SNR Filtering is a solution that can truly stop the collapse and restore reasoning quality.

Consistency Across Different RL Algorithms

The research team also validated the universality of Reasoning Collapse on four RL algorithms: PPO, GRPO, DAPO, and Dr.GRPO.

The results are very consistent.

Reasoning Collapse is an algorithm-independent systematic problem, and SNR Filtering is a universal solution.

This means Reasoning Collapse is not a bug of a specific algorithm, but a structural risk of multi-turn Agent RL.

And SNR Filtering is a structural repair.

Comparison of filtering strategies showing Top-p outperforming Top-k and baselines

Figure 6 | Showing that the Top-p filtering strategy consistently outperforms Top-k and the no-filter baseline across four environments.

07 A New Paradigm for Agentic RL

The significance of RAGEN-2 goes far beyond proposing a new metric or a new trick. It actually reshapes how we understand Agent reasoning quality and also reshapes the paradigm of how we train Agents.

RAGEN-2 shifts the measurement of reasoning quality from "Entropy" to "Mutual Information." It shifts the understanding of RL training stability from "Reward" to "SNR." It transforms Reasoning Collapse from a vague phenomenon into an explainable, diagnosable, and intervenable mechanism.

This provides a new theoretical framework for future Agentic RL.

  • MI Proxy can be directly integrated into existing RLHF, GRPO, and PPO training pipelines.
  • SNR Filtering is a lightweight, nearly zero-cost enhancement method.
  • It has value for Multi-modal Agents, Tool-use Agents, and Web Agents.

This means RAGEN-2's method is not something that can "only run in research teams," but can be directly landed in real systems.

The core issue in the Agent era is not "model capability," but "reasoning stability." RAGEN-2 provides new standards for stability assessment and training. It has a direct impact on the productization of AI Agents.

Future Agent systems will no longer just compete on who can call more tools or execute more steps, but on who can maintain a stable, reliable, and input-dependent thinking structure during multi-turn reasoning.

RAGEN-2 gives us a set of methods to make this stability controllable. (END)

Reference Materials: https://arxiv.org/pdf/2604.06268

End of article graphic

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.