Source: Synced
The Next Stop for Reinforcement Learning: From Supervised to Unsupervised
Reinforcement learning is reshaping the capability boundaries of large models. Cutting-edge models such as OpenAI o3, DeepSeek-R1, and Gemini 3 are using large-scale RLVR (Reinforcement Learning from Verifiable Rewards) to push the boundaries of reasoning tasks. But everyone knows that purely supervised training is unsustainable. The cost of human annotation grows exponentially, and obtaining reliable labels in specialized domains is increasingly difficult. When model capabilities approach or even surpass human experts, who will grade them?
Starting from TTRL, Unsupervised RLVR (Reinforcement Learning from Verifiable Rewards) has emerged, allowing models to evolve continuously without human annotations. This is not only a demand for cost reduction and efficiency, but also a necessary path toward superintelligence. Just as pre-training produced GPT using unlabeled data, can Unsupervised RLVR continue this miracle?
Paper Link: https://arxiv.org/abs/2603.08660
GitHub: https://github.com/PRIME-RL/TTRL/tree/urlvr-dev
X Thread: https://x.com/HBX_hbx/status/2031406636930338828
A latest study from the Tsinghua team has drawn the first boundary for this seemingly promising vision. Researchers systematically dissected the internal mechanisms of Unsupervised RLVR and found that all intrinsic reward methods based on the model's own signals—whether majority voting, entropy rewards, or other variants—follow a similar trajectory: performance rises rapidly in the early stages of training, but after reaching a certain critical point, begins an irreversible decline. This is not a defect of any particular method, but the destiny of the mechanism: they are essentially sharpening the model's existing preferences, like an echo chamber, causing the model to repeatedly reinforce what it initially believed. If the initial confidence happens to be correct, the results are astonishing; if there is a mismatch, collapse is only a matter of time.
But this doesn't mean intrinsic rewards have no value. In small-scale test-time training, it can still steadily improve performance—even if the model is completely wrong initially, it can evolve through self-correction. More importantly, the researchers found a "prophetic indicator" that can predict a model's trainability before large-scale training, without needing to run the full curve.
When intrinsic rewards are limited by the model's own echo, external reward methods begin to show a different picture—for example, allowing models to use the asymmetry between generation and verification to anchor rewards. Such methods are breaking through the ceiling of intrinsic rewards, truly pushing unsupervised reinforcement learning toward scalability.
On the road to superintelligence, what we need is not blind faith that models can evolve themselves, but knowing when to let them listen to their own echoes, and when to push them toward real-world verification.
Intrinsic Reward Methods: Deep Problems Beneath the Prosperity
Over the past year, various "intrinsic reward" methods have emerged intensively. From majority voting to variants based on model confidence/entropy, they use the model's internal signals to construct proxy rewards, with performance soaring in early training stages, even briefly surpassing supervised methods.
Researchers categorize these methods into two classes based on the source of rewards: one based on certainty, directly taking confidence metrics on reasoning trajectories as rewards; the other based on ensemble, using aggregated results from multiple rollouts (such as majority voting) to anchor correctness.
Although the source of rewards is free, the cost is expensive. After the initial surge in training performance, continuing training triggers typical reward hacking:
• proxy rewards continue to rise while true performance collapses
• the model grows increasingly confident, but answers become increasingly absurd
• different intrinsic reward methods perform vastly differently across different models
More critically, no one can explain why they work or why they fail.
What We Did: Opening the Black Box, Drawing Clear Boundaries
We don't just want to "propose new methods to benchmark higher"; we want to answer the question that no one has clarified:
Where are the scaling limits of Unsupervised RLVR? If there are limits, where are the boundaries?
To this end, we did five things:
• Unified Theoretical Framework: We brought seemingly diverse intrinsic reward methods under the same mechanism, revealing their convergent essence—sharpening the model's initial distribution—and providing theoretical convergence bounds.
• Large-scale Empirical Validation: 11 models × 5 intrinsic reward methods × hyperparameter sweeps, letting data speak to verify that "rise then fall" is not accidental, but a universal pattern.
• Mapping the Safe Zone: Not all scenarios collapse. We found that in small-scale test-time training, intrinsic rewards can be safely used, enabling stable evolution even when completely wrong initially.
• Turning Pitfalls into Signposts: Rise and fall is not just a risk; it is information itself. We used it to extract a model prior indicator that can predict whether a base model is suitable for reinforcement learning without running the full RL curve.
• Exploring Alternatives: Since intrinsic rewards have a ceiling, we looked externally. We preliminarily explored external reward methods based on generation-verification asymmetry to see if they can truly break through the scaling limits of intrinsic rewards.
Four Key Findings
Finding 1: Success Depends on the Degree of "Confidence-Correctness" Alignment
We established a unified theory for intrinsic reward methods, revealing their essence: sharpening the distribution—amplifying existing model preferences rather than creating new knowledge. This mechanism has a characteristic:
• If the model's initial tendency is correct → sharpening is effective, performance improves
• If the model's initial tendency is wrong → sharpening is harmful, accelerating collapse
We define the model's initial tendency (or model prior) as the degree of confidence-correctness alignment—that is, when we only improve the model's self-consistency, how likely is it to directly get more answers right? In other words, a model with strong priors has already mastered most of the knowledge needed to solve problems, but lacks the confidence to state the correct answers.
We tested 11 models, 5 methods, and 4 common hyperparameters. The conclusion seems brutal: collapse is inevitable, only a matter of time. Even the most stable configuration cannot survive more than a few epochs. This suggests it may not be an engineering problem, but a mathematical inevitability.
Left: Success depends on the alignment between confidence and correctness; Right: Evolution of confidence and correctness on individual data points throughout training
Finding 2: Safer in Small-Scale Scenarios
Rise and fall is destiny, but destiny has its scope of application.
When training data is sufficiently scarce—for example, in specific domain scenarios like Test-Time Training—intrinsic reward methods instead demonstrate rare stability. The reason is simple: optimizing confidence on only a few samples, the model cannot run far before hitting limits. Even if it becomes "super confident" on these samples, it is difficult to trigger global policy shifts, and accuracy on OOD (Out-of-Distribution) tasks remains steadily maintained.
Even more interesting is an extreme experiment: researchers deliberately selected 32 samples where the model was completely wrong as the training set. In other words, the proxy reward provided by intrinsic rewards was wrong from the start. The result? Performance on the OOD test set still improved steadily.
This shows that intrinsic rewards are not teaching the model "what is right," but rather teaching it "to believe in itself more." Even if it believes wrongly, this self-reinforcement is locked firmly in a local scope, unable to cause major waves.
Left: Small-scale TTT shows stable improvement without collapse; Right: KL divergence of policy under different training set sizes
Finding 3: Determining Whether a Model is Suitable for RL
Rise and fall is not just a risk; it is information itself.
Since the success of intrinsic rewards depends on the model's initial "confidence-correctness" alignment, can we use this alignment to pre-determine whether a base model is suitable for RL? After all, the cost of running large-scale RL is too high, and the field has long lacked a lightweight predictive indicator.
The researchers found a measuring stick: the Model Collapse Step, measuring how many steps a model can survive under intrinsic reward training before completely collapsing. The logic is simple: the later the collapse, the better the model's initial prior, meaning it has mastered more correct knowledge but lacks confidence; and this prior is exactly what standard supervised RL can amplify. In other words, the collapse point of intrinsic rewards is a natural indicator of a model's "RL trainability."
Results confirm this. The Qwen series, widely recognized as "suitable for RL," survives longer under intrinsic rewards. More interestingly, this indicator requires no ground truth annotations, yet predicts better than traditional pass@k.
Turning failure into signposts, turning expensive trial-and-error into lightweight prediction.
Left: Model collapse steps for different base models under unsupervised intrinsic reward training; Center: Performance improvement of corresponding base models under supervised RLVR. The later the collapse under unsupervised intrinsic rewards, the better the results after supervised RLVR, with prediction accuracy exceeding traditional pass@k.
Finding 4: External Rewards Are the Scalable Direction
If intrinsic rewards are destined to have a ceiling, where lies the path forward?
The root of the problem lies in the source of rewards. Intrinsic reward methods use the model's own confidence to train itself—like a closed-loop echo chamber, where reward signals are forever limited to what the model already knows. You cannot use it to teach the model what it truly doesn't know.
But Unsupervised RLVR is more than this. We categorize external reward methods into two classes:
• Leveraging unlabeled data: Mining reward signals from massive corpora. The more data, the richer the reward signals, which will not deplete as the model grows stronger.
• Leveraging generation-verification asymmetry: Letting models generate answers themselves, then using external tools (compilers, proof assistants, simulators) to verify and provide environmental feedback. These verifiers do not fail as the model grows stronger; their judgments remain forever objective.
We preliminarily tested self-verification methods, and results showed a completely different curve: continuous improvement without collapse. The reason is simple: rewards do not come from "how confident the model is," but from "whether the answer passes objective verification." Generating solutions may be difficult, but checking correctness is often simple; this asymmetry anchors the model's evolution to the iron laws of the real world, rather than its own echoes.
Intrinsic rewards ask "do you believe in yourself," while external rewards ask "is this true." The answer to scalable unsupervised reinforcement learning perhaps lies in the latter.
Final Words: Beyond the Boundaries
We have spent considerable space depicting the boundaries of unsupervised reinforcement learning. But the value of this map never lies in telling you "road closed," but in answering: under what conditions, which road is open.
Whether a system can improve by examining itself depends on how accurate its initial judgments are. The reason intrinsic reward methods fail is precisely the reason they succeed—the same mechanism: self-reinforcement. The only difference is whether what is reinforced is truth or bias.
Only when we recognize the destiny of intrinsic rewards can we truly see the vast ocean of external rewards. The path to scalable unsupervised reinforcement learning requires not blind faith that models can evolve themselves, but knowing when to let them listen to their own echoes, and when to push them toward real-world verification.
Intrinsic and external are not opposites, but different tools in the toolbox. Recognizing boundaries is not for stopping, but for creating freely within boundaries, and searching for new possibilities beyond them.