MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with an audience covering NLP master's and doctoral students, university teachers, and corporate researchers.
The community's vision is to promote exchanges and progress among the academic, industrial, and enthusiast circles of natural language processing and machine learning at home and abroad, especially for beginners.
Source | Machine Heart
In the era of large models, from code generation to mathematical reasoning, to autonomous planning Agent systems, reinforcement learning has almost become the standard configuration for the "last mile."
Intuitively, what developers really want is very simple: to make the model more likely to generate the "correct trajectory." From a probabilistic perspective, this is equivalent to maximizing the probability of correct output, which is the classic Maximum Likelihood (Maximum Likelihood) goal.
However, a new work from research institutions such as CMU, Tsinghua University, and Zhejiang University points out a rather disruptive fact:
In reality, the widely used reinforcement learning is not truly performing maximum likelihood optimization. Strict theoretical analysis shows that reinforcement learning is only optimizing the first-order approximation of the maximum likelihood goal — it is still far from the optimal training goal we thought.
Based on this observation, the research team re-examined the objective function of reinforcement learning and proposed Maximum Likelihood Reinforcement Learning: re-characterizing correctness-based reinforcement learning as a maximum likelihood problem of latent variable generation, and further introducing a family of objective functions indexed by computational cost, so that the training goal can gradually approach the true maximum likelihood optimization.
Paper title: Maximum Likelihood Reinforcement Learning
Paper link: https://arxiv.org/abs/2602.02710
Project address: https://zanette-labs.github.io/MaxRL/
Github address: https://github.com/tajwarfahim/maxrl
The "Bottleneck" Problem of Traditional Reinforcement Learning
In tasks such as code generation, mathematical reasoning, and multi-step decision-making, we have formed a nearly default consensus: as long as the feedback is binary and the process is non-differentiable, use reinforcement learning.
The reinforcement learning paradigm has supported a series of key advances from AlphaGo to the improvement of large language model reasoning capabilities.
From an end-to-end perspective, reinforcement learning is given an input, and the model implicitly induces a "success probability." If we ignore the differentiability constraint, the most natural and principled goal is maximum likelihood.
But the research team found that: reinforcement learning based on expected reward is actually only optimizing the first-order approximation of the maximum likelihood goal. More specifically, the maximum likelihood goal can be expanded at the overall level into a series of terms based on pass@k events, while standard reinforcement learning only optimizes the first-order term.
Simply put, reinforcement learning does not truly maximize the probability of the model generating the correct answer, but optimizes a substitute goal that has a systematic deviation from the true likelihood.
This also explains a widespread but difficult-to-articulate phenomenon: reinforcement learning made rapid progress in the early stages, but as it progressed, performance improvements became increasingly difficult.
The research team, targeting this new discovery, re-characterized "reinforcement learning based on correctness feedback," and the main contributions of the paper are as follows:
Formalized correctness-based reinforcement learning as a maximum likelihood problem of latent variable generation, and proved that standard reinforcement learning only optimizes the first-order approximation of the maximum likelihood goal.
Proposed a family of objective functions indexed by computational cost, achieving continuous interpolation between expected return and precise maximum likelihood by performing Maclaurin expansion on pass@k events.
Derived a simple on-policy estimator, whose expected gradient is completely consistent with the likelihood approximation goal indexed by this computational cost, meaning that increasing sampling truly improves the optimized goal itself.
Maximum Likelihood: Truly Improving the Optimization Goal
The research team believes that maximum likelihood estimation performs excellently in supervised learning, so why not directly implement it in reinforcement learning?
The observation in the previous section suggests that we can construct a family of objective functions that change with computational cost, gradually introducing higher-order terms; as available computing resources increase, this family of objective functions will gradually converge to the complete maximum likelihood goal.
Through a series of derivations, the paper performs a Maclaurin expansion of the maximum likelihood goal in terms of failure events:
The maximum likelihood gradient in the expansion is difficult to estimate with finite samples.
Especially, estimating the pass@k gradient for large k values requires more and more samples, especially when the pass rate p is very small. This difficulty with finite samples is the motivation for proposing Maximum Likelihood Reinforcement Learning (MaxRL).
The research team defines MaxRL as a class of reinforcement learning methods that explicitly take maximum likelihood as the goal, rather than the pass rate goal, while still being implementable under the conditions of finite sampling and non-differentiable generation. Below, we consider a principled method to achieve this goal.
Consider approximating the maximum likelihood goal by truncating the Maclaurin expansion to a finite order, and then estimating this goal. For the truncation level T ∈N, we define the truncated maximum likelihood goal for a fixed input x as:
Taking the derivative yields the truncated overall gradient:
This defines a family of objective functions: T = 1 reduces to reinforcement learning, T → ∞ reduces to maximum likelihood, and intermediate T values interpolate between the two. Therefore, the truncation level T directly controls the order of correctness events that contribute to learning. As more computational cost is consumed in rollout, estimating higher-order gradients becomes feasible.
In other words: MaxRL provides a principled framework for trading increased computational cost for higher-fidelity approximation of the maximum likelihood goal.
The above formula has already provided a feasible unbiased estimation idea: using the pass@k gradient estimator to approximate each term in the finite series separately. Under this strategy, any improvement to the pass@k estimator will directly translate into a better gradient estimate for the truncated maximum likelihood goal.
However, in this paper, the researchers took a different path, which brings a more concise estimator form and also provides a new understanding perspective.
The gradient of the maximum likelihood goal can be written in the following conditional expectation form:
This theorem shows that the maximum likelihood gradient is equivalent to averaging only the gradients of successful trajectories. This interpretation provides a direct way to construct a specific gradient estimator: simply use the successful trajectories obtained by sampling to perform a sample average on the above conditional expectation.
The core insight is that: the gradient of the maximum likelihood goal can be expressed as an expectation under the "successful conditional distribution."
Therefore, this paper adopts a simple strategy: sampling from the non-conditional policy distribution, but only averaging the successful trajectories, obtaining a reinforcement learning-style estimator, which has the characteristic that as the number of rollouts increases, the approximation to the maximum likelihood gradient will continuously improve.
In other words, under the MaxRL framework, additional computational resources not only improve the estimation quality but also directly improve the optimized goal itself.
Surprising Efficiency Progress
In experiments, this change brought benefits far beyond expectations. The research team conducted a systematic evaluation of MaxRL across multiple model scales and various task types. The results show that MaxRL consistently outperforms existing reinforcement learning methods in the trade-off between performance and computational efficiency.
The experimental results intuitively demonstrate the advantage of MaxRL in training efficiency. Under the same number of training steps, MaxRL's performance improvement is significantly faster, and as the number of rollouts increases, MaxRL continues to benefit.
This advantage is not only reflected in the training stage. Compared to models trained with GRPO, MaxRL's scaling efficiency during testing can be improved by up to 20 times.
On the maze task, regardless of the test-time sampling budget k, as the number of training rollouts increases, MaxRL can continuously reduce −log (Pass@k), while the improvement of GRPO and RLOO tends to flatten out much earlier. This result intuitively demonstrates MaxRL's superior performance-efficiency trade-off during the training stage.
Comparing the optimization trends of each method as training sampling computation increases under different pass@k settings, we can see that for GRPO and RLOO, the curve flattens out quickly after an initial decline, indicating that additional sampling is mainly used to reduce noise; while MaxRL maintains a continuous decline under different k values, driving the model to continuously approach a more maximum-likelihood optimization goal.
Under larger-scale settings, MaxRL's advantage remains stable. This indicates that the improvements brought by MaxRL do not depend on specific scale or hyperparameter settings. When the training scale expands, MaxRL does not show a rapid decline in benefits or a disappearance of its advantage.
Further experimental results show that MaxRL's advantages do not rely on overly idealized experimental conditions. Even in settings where feedback is noisy or validation signals are not completely reliable, MaxRL can still maintain a relatively stable performance advantage.
Overall, MaxRL provides a more in-depth solution for non-differentiable, sampling-based learning problems. It systematically approaches true likelihood optimization through a target framework that naturally expands with computational cost.
When the optimization goal itself can evolve with computing power and gradually approach maximum likelihood, will reinforcement learning become the long-term answer to general intelligence, or just a transitional solution to the next training paradigm?
For more information, please refer to the original paper.