Author: Wu Jiayun, PhD student in the Machine Learning Department at Carnegie Mellon University (CMU), researching evaluation and post-training of large language models, including model reasoning, model hallucination, and active evaluation.
Hallucination in Large Language Models (LLMs) has long been a core obstacle preventing their deployment in critical domains. Recently, researchers proposed a new method called Behaviorally Calibrated Reinforcement Learning. By redesigning the reward function, this approach enables models to learn "to know what one knows and to admit what one does not know."
Paper link: https://arxiv.org/abs/2512.19920
After training with this method, a model with only 4 billion parameters achieved hallucination suppression capabilities surpassing those of frontier large models like GPT-5.
Figure 1: Example of confidence annotation in model output when answering math problems. Each statement is accompanied by a confidence score and rationale.
Core Question: Why Do LLMs Hallucinate?
The research team points out that the current mainstream post-training paradigm for large models—Reinforcement Learning with Verifiable Rewards (RLVR)—suffers from a fundamental reward misalignment issue. In standard RLVR, the reward function is typically binary: +1 for a correct answer and -1 for an incorrect one. Under this mechanism, as long as the probability of being correct is greater than zero, an agent seeking to maximize utility is incentivized to generate potentially incorrect answers. This results in penalties for "refusing to answer," forcing the model to suppress expressions of uncertainty and disguise guesses as facts. The model is trained to be an "excellent test-taker"—guessing to maximize expected scores—rather than an "honest communicator"—choosing to abstain when confidence is low.
Solution: Behaviorally Calibrated Reinforcement Learning
To address the above issues, the research team proposed a solution based on behavioral calibration. The core idea is that a trustworthy model should dynamically adjust its refusal behavior according to a user-specified risk threshold :
When
, the model operates in "test-taker mode," attempting to answer questions as much as possible;
When
, the model operates in "fully honest mode," answering only when absolutely certain;
In general cases, the model outputs a substantive answer if and only if its confidence
; otherwise, it outputs
.
To achieve this goal, the research team designed two strategies:
Strategy 1: Verbalized Confidence
This strategy trains the model to explicitly output a scalar confidence score alongside its answer. When the model's confidence
is below the user's risk threshold
, it refuses to answer, and rewards are assigned as follows:
Correct answer: +1 point;
Incorrect answer: -1 point;
Refusal to answer:
points.
The research team integrates the reward function across different user risk preferences , transforming the training objective from conditional optimization with an explicit risk threshold to optimizing a proper scoring rule for Verbalized Confidence.
For uniformly distributed risk preferences , the derived reward function resembles the Brier score:
This reward can be decomposed into the difference between the correctness reward and the Brier score for confidence calibration
, incentivizing the model to maximize prediction accuracy while calibrating its stated confidence.
For a general cumulative distribution function of risk preferences , the general form of the reward function is:
Strategy 2: Critic Value Function
As an alternative to explicitly generating confidence, this strategy uses the value function of the Critic network in the PPO algorithm as an implicit confidence estimator. Theoretically, the Critic network is trained by minimizing the Brier score between predicted values and policy returns, causing its value function to converge to the probability of success.
Statement-Level Behavioral Calibration: Fine-Grained "Uncertainty" Annotation
The research team further extended behavioral calibration from the response level to the statement level, enabling the model to precisely annotate individual uncertain reasoning steps within an answer rather than simply rejecting the entire response. This extension faces three major challenges:
Challenge 1: Coherence. Directly replacing uncertain statements with "<IDK>" may disrupt reasoning coherence—for example, in math problems, subsequent steps often depend on previous conclusions. The team chose to have the model output the full response while using HTML tags to visually highlight uncertain statements.
Challenge 2: Ambiguity of Intermediate Steps. In Chain-of-Thought (CoT) reasoning, the correctness and confidence of intermediate steps are inherently ambiguous: a step might correctly identify an error in a previous statement. Therefore, the team ignores the intermediate reasoning process and performs calibration only on the final structured steps.
Challenge 3: Lack of Fine-Grained Labels. Statement-level correctness labels are difficult to obtain. The team designed a learning objective based on weak supervision: aggregating statement-level confidence into response-level confidence and then training using Brier score rewards.
Specifically, for a response containing statements, the team explored two aggregation methods:
Product Aggregation:
, assuming independence among statements, where the final result is correct if and only if all statements are correct.
Minimum Aggregation:
, where the overall confidence is determined by the least confident step, forcing the model to assign low confidence to the most error-prone steps.
Experiments found that minimum aggregation performed better in statement-level evaluation, as it more effectively incentivizes the model to identify weak links in the reasoning chain. While product aggregation is more suitable for response-level calibration, it may lead to overly optimistic confidence scores for individual statements.
Experimental Results
The research team evaluated the method on several benchmarks, including the highly challenging math reasoning benchmark BeyondAIME released by ByteDance's Seed team, as well as AIME-2024/2025 and SimpleQA (a cross-domain factual QA benchmark).
Core Evaluation Metrics
Signal-to-Noise Ratio Gain (SNR Gain): Given a risk threshold , the signal-to-noise ratio is defined as the ratio of correct responses to hallucinated responses in the model's output, i.e.,
. A higher SNR indicates that the model produces far more correct answers than incorrect ones. SNR Gain is the average increase in SNR relative to always answering, across the entire range of risk thresholds from
.
Confidence AUC: This metric ranks correct and incorrect answers using the model's confidence scores and calculates the area under the ROC curve. An AUC closer to 1 indicates that the model accurately assigns high confidence to correct answers and low confidence to incorrect ones. This is a pure measure of the model's "self-awareness," unaffected by the model's inherent capability strength.
Response-Level Evaluation: Surpassing GPT-5
Response-level evaluation results on BeyondAIME (Table 1) show that the proposed method significantly outperforms models like Qwen3-max, Kimi-K2, Gemini-2.5-Pro, and GPT-5. Specifically, the 4-billion-parameter model using Verbalized Confidence and confidence product aggregation (Qwen3-4B-Instruct-confidence-prod) achieved an SNR Gain of 0.806, substantially surpassing GPT-5's 0.207. The model using the Critic value function (Qwen3-4B-Instruct-ppo-value) also achieved comparable results.
Table 1: BeyondAIME response-level evaluation results. SNR Gain and Conf AUC are key metrics for measuring hallucination suppression; higher values indicate more effective hallucination suppression.
Statement-Level Evaluation: Surpassing Gemini-2.5-Pro
The research team also extended behavioral calibration from the response level to the statement level, allowing the model to precisely annotate individual uncertain reasoning steps. In the statement-level evaluation on BeyondAIME (Table 2), the confidence minimum aggregation method achieved an SNR Gain of 0.301, significantly outperforming Gemini-2.5-Pro's 0.019.
Table 2: BeyondAIME statement-level evaluation results. The minimum aggregation method significantly leads frontier models on both core metrics: SNR Gain and Conf AUC.
Confidence Calibration Plot: Most Frontier Models Lack "Self-Awareness"
The Confidence Calibration Plot (Reliability Diagram) is a crucial visualization tool for evaluating a model's "self-awareness." The dashed line represents perfect calibration, where the model's stated confidence equals its actual accuracy. As clearly seen in Figure 2, the calibration curves of frontier models (including Gemini-2.5-Pro, Qwen3-Max, etc.) are almost horizontal lines. This means that regardless of how "confident" the model claims to be, its actual accuracy remains at a similar level. This indicates that these models lack the ability to distinguish between correct and incorrect answers. Only GPT-5 and o4-mini output confidence scores with practical meaning. In contrast, models trained with behavioral calibration (Figure 3) exhibit ideal calibration characteristics. The monotonically increasing calibration curve proves that the model has learned to honestly express its uncertainty.
Figure 2: Response-level confidence calibration plot of frontier models on BeyondAIME. It can be observed that the accuracy of many models is a horizontal line, showing almost no correlation with their stated confidence.
Figure 3: Confidence calibration plot of the study's model on BeyondAIME. After behavioral calibration training, the model's accuracy shows a strong positive correlation with its stated confidence. Base and Base-ppo serve as baselines.
Four Goals of Behavioral Calibration
Figure 4: Curves showing changes in accuracy, refusal rate, and hallucination rate under different risk thresholds. The green area represents accuracy, the yellow area represents the refusal rate, and the red area represents the hallucination rate. As the risk threshold t increases, the model gradually transitions from "test-taker mode" to "fully honest mode."
The system designed by the research team meets four goals of behavioral calibration:
Goal 1: Adaptive Risk. The model can automatically adjust its refusal strategy based on the user-specified risk threshold . As observed in Figure 4, as the risk threshold
increases, the hallucination rate (red area) drops rapidly. Unlike the "convex" refusal curves of frontier models and baseline PPO models, the "concave" refusal curve of this study's model indicates that it adapts to risk changes faster, effectively reducing hallucinations even at lower risk thresholds.
Goal 2: Accuracy Maintenance. In the (no refusal) mode, the calibrated model's accuracy is comparable to or even better than the standard RL fine-tuning baseline.
Goal 3: Hallucination Reduction. As the risk threshold increases, the hallucination rate decreases monotonically. When
(fully honest mode), the hallucination rate drops to nearly zero. Simultaneously, the Signal-to-Noise Ratio SNR (ratio of green area to red area) increases significantly.
Goal 4: Quantitative Calibration. The model satisfies two quantitative constraints:
True Positive Rate (TP): Among the questions the model chooses to answer, the proportion of correct answers is not lower than the risk threshold
.
False Negative Rate (FN): Among the questions the model chooses to refuse, the proportion that could have been answered correctly should not exceed
.
Figure 5 displays the TP and FN curves for various models. The TP curve lies mostly above the diagonal , and the FN curve lies mostly below the diagonal, satisfying the quantitative constraints of behavioral calibration.
Figure 5: True Positive (solid line) and False Negative (dashed line) for behavioral calibration. The TP curve should be above the diagonal, and the FN curve should be below the diagonal. Base and Base-ppo are baselines.
Cross-Domain Generalization: Transferability of Meta-Skills
To verify whether the meta-cognitive abilities trained by this method are transferable, the research team directly evaluated the model trained on math data on SimpleQA (a challenging long-tail factual knowledge benchmark) in a zero-shot setting.
The results show that the method's SNR is significantly better than the base instruction model and surpasses most evaluated frontier models, performing comparably to the strongest frontier models including Claude-Sonnet-4.5 and GPT-5. Due to the zero-shot evaluation setting, behavioral calibration was effectively transferred to new domains where the model lacked basic knowledge. This indicates that behavioral calibration is a skill decoupled from prediction accuracy.
Research Insights:
Hallucination Mitigation and Accuracy Are Independent Capabilities
This study also provides some theoretical insights:
1. Hallucination mitigation and factual accuracy are two distinct capabilities. The research team observed that for some frontier models, there is no positive correlation between accuracy and hallucination rate or confidence calibration. The advantage of GPT series models lies more in their ability to control hallucinations rather than just accuracy.
2. Small models can achieve confidence calibration comparable to large models. The computational resources required to achieve effective "calibration" are far lower than those needed to pursue absolute accuracy. Conversely, the verbalized confidence of some large models does not accurately reflect their actual performance.
3. Behavioral calibration is a learnable attribute that can be improved through training. This contrasts with the previous view that hallucination is an inevitable built-in characteristic of LLMs.
© THE END
Please contact our official account for reprint authorization.
Submissions or media inquiries: liyazhou@jiqizhixin.com