Xiaomi Introduces JudgeRLVR: Judge First, Generate Second — Breaking the Efficiency Paradox of "Long Chain-of-Thought" in Reasoning Models

若影片無法播放，請改看來源頁。

Paper Title: JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Paper Link: https://arxiv.org/pdf/2601.08468

TL;DR

Core Problem: Existing Reinforcement Learning with Verifiable Rewards (RLVR) tends to induce models to generate lengthy Chains-of-Thought (CoT) filled with trial-and-error and backtracking, leading to inefficient reasoning and insufficient information density. Although heuristic length penalties can mitigate this issue, they often compromise accuracy.

Solution: Propose JudgeRLVR, a two-stage training paradigm of "Judge First, Generate Second". The first stage trains the model to distinguish between correct and incorrect solution processes (discriminative capability); the second stage initializes the policy model with the discriminative model for standard RLVR fine-tuning (generative capability).

Main Conclusion: Experiments on the Qwen3-30B-A3B model show that JudgeRLVR improves average accuracy by 3.7 percentage points on in-domain mathematical tasks while reducing average generation length by 42%; it demonstrates stronger generalization capabilities on out-of-domain tasks. The method encourages the model to internalize external "trial-and-error" into internal "discrimination", thereby generating more direct and efficient reasoning paths.

1. Background

In the evolution of reasoning capabilities in Large Language Models (LLMs), Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for improving models' ability to solve complex mathematical and coding problems (e.g., DeepSeek-R1). RLVR incentivizes models to explore solution strategies beyond the Supervised Fine-Tuning (SFT) data through sparse but objective supervision signals (i.e., whether the final answer is correct).

However, RLVR introduces a significant side effect: the "disordered expansion" of the Chain-of-Thought.

1.1 Blind Spots of Result-Oriented Optimization

Since RLVR primarily optimizes the correctness of the final answer, models often tend to adopt a "generative search" strategy: enumerating a large number of possible attempt branches, constantly correcting intermediate steps, and performing explicit self-corrections to "stumble upon" the correct answer. This behavior pattern leads to two main issues:

1. Reasoning Redundancy and Inefficiency: The generated Chain-of-Thought (CoT) trajectories are extremely long, filled with backtracking and self-negation. For example, "Let me try again", "This doesn't seem right, verify via...", etc. While this ensures accuracy, it significantly increases the computational overhead during inference (Token consumption).

2. Low Information Density: Long output does not equate to high-quality reasoning. Existing studies (e.g., Kimi k1.5, DAPO) attempt to introduce Length Penalty to suppress Token count, but this often creates an irreconcilable trade-off: shortening the length usually truncates key reasoning steps, thereby reducing accuracy.

1.2 Inspiration from Cognitive Science

The authors drew inspiration from cognitive science (Chi et al., 1981): The difference between experts and novices lies not in whether they search, but in where the search occurs.

Novices: Tend to perform externalized trial-and-error, writing all attempt paths on paper (or generating them in the context).

Experts: Possess the ability of "early discrimination and pruning", able to identify and prune low-value paths before the thought unfolds, thus outputting only high-value reasoning processes.

Based on this, the authors hypothesize: Discriminative Capability is a prerequisite for efficient generation. Only when the model learns to distinguish between "good reasoning" and "bad reasoning" can it internalize this guidance signal during the generation stage, thereby spontaneously pruning the search space without relying on explicit length penalties.

2. JudgeRLVR Two-Stage Paradigm

JudgeRLVR decomposes the training of reasoning strategies into two sequential stages: the Judging Stage and the Generating Stage.

Figure 2 JudgeRLVR Two-Stage Training Pipeline

2.1 Symbol Definitions

Problem domain Q, standard answer y*.

Solution Response, a sequence of tokens containing the logical process and ending with the final answer.

Extract predicted answer y^ via a deterministic parser.

Correctness label c, is 1 if and only if y^ = y*.

2.2 First Stage: Judging Stage

The goal of this stage is to train the model as a "Judge", enabling it to identify incorrect paths.

Data Construction: Hard Negative Mining

To train a high-quality discriminator, data construction is crucial. The authors adopted the following strategies:

1. Rollout Generation: For each problem, use multiple models (e.g., MiMo-7B RL and target model Qwen3-30B-A3B-SFT) to generate a set of candidate responses.

2. Hard Negative Mining: Prioritize "medium difficulty" problems with a pass rate neither 0 nor 1. The incorrect answers generated by such problems are usually "almost correct", providing more value for discriminative training than pure random errors.

3. Class Balance: Down-sample positive and negative samples to prevent the model from learning class prior biases.

Training Objective

The model receives a problem and a candidate response, outputting two parts:

1. Critique/Commentary: Contains an analysis of the reasoning process.

2. Verdict Token: 0 for incorrect, 1 for correct.

The reward function is defined as whether the verdict matches the true label.

The policy network learns the conditional probability.

The key here is that the model must not only learn to "solve problems" but also learn to "examine problems" and "spot errors". This training method forces the model to establish an internal evaluation standard for the rigor of reasoning logic.

2.3 Second Stage: Generating Stage

This stage returns to the standard Vanilla RLVR setting, but the key lies in the initialization.

Initialization: The policy model is initialized using the weights of the discriminative model trained in the first stage.

Training Process: Given a problem, the model generates a chain-of-thought and an answer.

Reward Signal: Only uses the sparse binary final answer correctness reward.

Mechanism Hypothesis

The authors hypothesize that this two-stage training improves reasoning quality through two mechanisms:

1. Style Transfer: The training in the discriminative stage changes the model's language style, making it lean towards more objective and prudent expressions.

2. Reduced Backtracking: The model activates the internalized discriminative mode during the generation stage, completing the originally explicit "verification-correction" process in the latent space (Internal Hidden States),表现为 text中回溯性词汇的减少。

3. Experimental Setup

To verify the effectiveness of the paradigm, the authors conducted extensive tests on mathematical reasoning and general capability benchmarks.

3.1 Models and Algorithms

Base Model: Qwen3-30B-A3B (MoE architecture), underwent basic SFT to gain instruction following ability.

Training Algorithm: DAPO (Yu et al., 2025), a policy gradient method belonging to the GRPO (Group Relative Policy Optimization) family.

Training Hyperparameters:

Rollout size = 16.

Dynamic sampling (filter out all-correct or all-incorrect samples).

Learning rate = 1e-6.

Max Token count 65536 (supports long CoT).

3.2 Evaluation Benchmarks

In-Domain Math: AIME24, AIME25, MATH500, HMMT_feb_2025, BeyondAIME.

Out-of-Domain Generalization:

GPQA Diamond (Scientific Reasoning)

IFEval (Instruction Following)

LiveCodeBenchv6 (Code)

MMLU-Redux (General Knowledge)

ZebraLogic (Logical Reasoning)

3.3 Comparison Baselines

1. Base SFT: The base model without RL training.

2. Vanilla RLVR: Single-stage training using only final answer rewards (total 250 steps).

3. JudgeRLVR: Judge first (145 steps) then generate (105 steps), total steps consistent with Vanilla RLVR for fair comparison.

4. Main Experimental Results Analysis

Table 1 Comparison of Main Results for Base SFT, Vanilla RLVR, and JudgeRLVR (Sequential)

4.1 In-Domain Math: Dual Leap in Quality and Efficiency

On high-difficulty math benchmarks like AIME and HMMT, JudgeRLVR demonstrated significant advantages:

Accuracy Improvement: Compared to Vanilla RLVR, JudgeRLVR achieved positive gains on all math leaderboards. For example, +9.2 percentage points on HMMT_feb_2025 and +2.7 percentage points on AIME24.

Significant Length Reduction: This is the most significant result. On AIME24, the average generation length dropped from 21.8k to 12.9k (-41%); on MATH500, it decreased by 71%.

Conclusion: This directly verifies that the "discriminative prior" can effectively prune invalid search branches. In contrast, Vanilla RLVR relies on "piling up length" for marginal accuracy gains, with reasoning processes filled with redundancy.

4.2 Out-of-Domain Generalization: Capability Transfer

In non-mathematical tasks, JudgeRLVR also performed well:

GPQA Diamond: Accuracy improved +5.2, length reduced by 7.5%. Shows that scientific reasoning also benefits from more rigorous discriminative capabilities.

Code Tasks (LiveCodeBench): Accuracy improved +5.7, length reduced by 18%. Code generation often requires precise logical planning, which discriminative training clearly aids.

Instruction Following (IFEval): Accuracy improved +6.5, but interestingly, length increased by 12%. This suggests that for tasks requiring strict format and constraints, the model learned to ensure compliance through more detailed checks (rather than blind trial-and-error).

Overall, JudgeRLVR improved by an average of +4.5 percentage points on out-of-domain tasks, proving that the paradigm learns a general "high-quality thinking pattern" rather than merely fitting math problems.

5. Ablation Studies and Mechanism Analysis

To investigate the source of gains, the authors designed two important ablation experiments.

Table 2 Comparison of JudgeRLVR, Judge Only, and Mixed Strategy

5.1 Why Not Just Discrimination?

What if only the first stage of discriminative training (Judge Only) is performed?

Result: Compared to JudgeRLVR, Judge Only had lower accuracy on all math tasks, and the generation length significantly increased (e.g., length increased by 74% on AIME24).

Analysis: This indicates that discriminative training itself does not automatically translate into concise generation strategies. Conversely, a pure "critic" model may become overly cautious and verbose, tending to repeatedly dwell on checking processes in the output. The generation stage (RLVR) is essential; it is responsible for converting this sensitivity to errors into efficient path selection strategies.

5.2 Why Must It Be Two Stages?

What if the discrimination task and generation task are mixed and trained in parallel (Mixed Strategy)?

Result: Unstable performance. Although close to JudgeRLVR on some tasks, it regressed significantly on IFEval and code tasks, and generation length was generally longer.

Analysis: Mixed training causes the model to optimize two different objectives (discrimination vs generation) in the same stage, which hinders the formation of a clear internal decision-making process. The sequentially executed strategy (learn to discriminate first, then learn to generate) better fits the "learn to walk before you run" learning curve.

6. What Did the Model Actually Learn?

The authors revealed evidence of JudgeRLVR changing the model's thinking pattern through qualitative and quantitative analysis.

6.1 Style Transfer (Perplexity Analysis)

Figure 3 Perplexity (PPL) Change of Base SFT During Training

The authors used the Base SFT model as a probe to calculate the perplexity (PPL) of the model output during training.

Vanilla RLVR: PPL remains flat, indicating its output style differs little from Base SFT.

JudgeRLVR (First Stage): PPL rises significantly. This indicates that discriminative training drastically changed the model's language distribution, introducing a "judge style" different from the original SFT. This style bias (Inductive Bias) lays the foundation for efficient generation in the second stage.

6.2 Reduced Explicit Backtracking

Figure 4 Frequency Change of Transition/Backtracking Words During Training

The authors counted the frequency of transition words (e.g., but, however, wait, actually, etc.) in the generated text.

Result: During the generation stage training of JudgeRLVR, both the absolute number and relative frequency of these words showed a significant downward trend.

Interpretation: This provides strong linguistic evidence that the model no longer relies on explicit "write out errors then correct", but has learned to perform implicit prediction and pruning before the chain-of-thought unfolds.

7. Case Study: Qualitative Change in Chain-of-Thought

The paper intuitively demonstrates the difference between the two paradigms through a specific coordinate transformation problem (Cartesian to Polar coordinates).

Figure 1 Reasoning Pattern Comparison: Vanilla RLVR vs JudgeRLVR

Vanilla RLVR's Chain-of-Thought:

Filled with hesitation and repeated verification: "Let me confirm...", "Just to be thorough...", "But here...".

Even repeatedly confirming very basic facts, appearing lacking in confidence.

Consumed a large number of Tokens describing mental activities rather than solution logic.

JudgeRLVR's Chain-of-Thought:

Gets straight to the point, directly listing formulas.

Calculation process advances linearly, no redundant branches.

Directly outputs the answer after deriving the result, no excessive self-doubt.

Result: Clear logic, only one-third the length of Vanilla, and correct answer.

8. In-depth Discussion and Outlook

8.1 New Solution to Efficiency vs Quality Trade-off

For a long time, there has been a misconception in the RLVR field that longer CoT inevitably leads to better performance (Test-time Compute Scaling). JudgeRLVR challenges this view, pointing out that many current long CoTs are actually inefficient "pseudo-reasoning". By increasing the information density of Tokens, we can achieve higher accuracy at shorter lengths. This has significant implications for reducing LLM inference costs.

8.2 Relationship with Process Reward Model (PRM)

The first stage of JudgeRLVR can be seen as a form of implicit PRM training, but it does not require expensive step-by-step annotated data. By constructing a full-sequence discrimination task (distinguishing Good/Bad Response), it allows the model to learn the perception of process quality by itself. This provides a new path for improving reasoning capabilities in scenarios lacking fine-grained annotations.

8.3 Limitations

Although JudgeRLVR performs well on mathematical and logical tasks, for tasks requiring high creativity or divergent thinking (e.g., creative writing), will premature "pruning" suppress diversity? This point still needs further exploration.

For more details, please read the original paper.

Xiaomi Introduces JudgeRLVR: Judge First, Generate Second — Breaking the Efficiency Paradox of "Long Chain-of-Thought" in Reasoning Models

Related Articles

分享網址