The first author of this article, Mengqi Li, is a Ph.D. student in Computer Science at The Chinese University of Hong Kong, Shenzhen. This research was conducted in collaboration with Professor Lei Zhao of Shanghai Jiao Tong University and Professor Wai-Shu Su of The Chinese University of Hong Kong, and completed under the joint supervision of Professor Ruoyu Sun and Professor Xiao Li at The Chinese University of Hong Kong, Shenzhen.
In post-training for reasoning, most methods still rely on reward models, verifiers, or additional teacher signals. If we forgo these external signals and only use answers generated by the model itself for self-training, can reasoning ability still be improved? Yes! SePT (Self-evolving Post-Training) provides an affirmative answer. This simple self-training method achieves up to a 10-point increase in accuracy on mathematical reasoning tasks!
Paper Title:
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
Paper:
Code:
A schematic of SePT's online self-training loop: Samples are generated with a sampling temperature τ_s, and the training phase employs standard SFT; the next round of training data is generated by the updated model.
As shown in the flowchart, the core of SePT is elegantly simple: the current model generates answers, these answers are used for standard SFT, and the updated model then generates the next round of training data. Let's first look at how much improvement this online loop self-training actually brings.
Results on Mathematical Reasoning: SePT vs Baseline
The main results are shown in the figure. The baseline here is not the base model evaluated under default sampling settings, but a strong baseline that underwent no post-training but was subjected to a temperature sweep during the inference stage, with the best result taken. After SePT self-training, there are noticeable improvements in Pass@1, Pass@8, Pass@32, and AVG averaged across six mathematical benchmark test sets.
On Qwen2.5-Math-7B, the average Pass@1, Pass@8, Pass@32, and AVG across six math benchmarks: SePT significantly outperforms the baseline on all metrics.
Results on Mathematical Reasoning: SePT vs RLVR
When further compared with an RLVR method (GRPO), we can see that the self-training method SePT already achieves results quite close to GRPO, especially on the OTM dataset.
According to the results in the table, on Qwen2.5-Math-7B, when using OTM, the AVG for SePT and GRPO are 55.2 and 56.6 respectively, a difference of only 1.4; whereas on DSR, this gap widens to 4.1 (55.0 vs. 59.1). On DeepSeek-Math-7B-Instruct, the same gaps are 0.4 (33.0 vs. 33.4) and 1.7 (33.9 vs. 35.6), respectively. Remarkably, under the OTM setting on Qwen2.5-Math-7B, SePT's Pass@1 even slightly surpasses GRPO (40.8 vs. 39.5).
These results indicate that, under the comparison settings of this paper, SePT exhibits less variance with the choice of training problem sets, whereas GRPO's gain is more pronounced on DSR.
Average benchmark comparison of OpenThoughts-Math (OTM) and DeepScaleR (DSR) on Qwen2.5-Math-7B and DeepSeek-Math-7B-Instruct. The two training sets are similar in size. Δ indicates the change of DSR relative to OTM; shaded areas mark cases where DSR outperforms OTM by at least 2.0 points.
Detailed SePT Algorithm Flow
SePT features an extremely simple self-training framework design, which can be summarized into the following three steps:
1. Sample problems from the question bank, and have the current model generate answers at a sampling temperature ;
2. Perform standard SFT on the current model using these self-generated samples;
3. The updated model then generates the next round of training data.
The key to this design can be summarized as: temperature decoupling, standard SFT training, and data self-generation by the latest model.
If the model used for sampling in the current round is denoted as , the training problem set as
, and the sampling temperature as
, with the training temperature being
, the training objective of SePT can be written as
.
In other words, SePT does not introduce any extra reward, advantage, verifier, or teacher signal; the training phase is standard negative log-likelihood, i.e., standard SFT, with the only difference being that the training samples come from trajectories generated by the model itself in the previous round at a temperature of .
In the experiments of this paper, standard SFT training is adopted, i.e., , with the default being to sample only once per prompt (
), which is one reason SePT is very lightweight in terms of engineering.
Online Self-Generated Data in SePT
This paper further validates the importance of this design through ablation studies: if "the latest model generates the next round of training data round by round" is replaced with fixed data training, performance drops significantly. Taking Qwen2.5-Math-7B as an example, SePT (Offline) achieves an AVG of 45.5, whereas the online version, SePT, reaches 55.0.
Comparison of SePT and SePT (Offline) on Qwen2.5-Math-7B. Values in parentheses indicate the change relative to the baseline.
Temperature Decoupling in SePT
SePT involves two temperatures: one for exploration during generation and one for maintaining standard SFT during training; these two need not be bound together. The sampling temperature used when generating self-training samples is , while the standard setting in the training phase is
= 1.
Why does this matter? Theorem 1 of this paper provides an intuitive theoretical argument:
If, at a certain prefix , the sampling distribution of the old model is denoted as
, and the optimal solution after training is denoted as
, then there exists a constant
such that
.
Therefore, for any two tokens , we have
.
This means that when (the primary choice in this paper's experiments), the pairwise logit margin is amplified by a factor of
. Intuitively, low-temperature sampling + standard temperature training does not simply make the model more conservative; rather, while preserving the relative order between pairs of tokens as much as possible, it appropriately widens the preference boundaries already existing from pre-training.
This has also been directly validated experimentally. Taking Qwen2.5-Math-7B as an example, when using temperature coupling, Pass@1/Pass@8/Pass@32/AVG were only 19.3/50.1/64.3/44.6, with Pass@1 even lower than the baseline; after adopting decoupling, the four metrics improved to 39.5/57.7/67.9/55.0.
In other words, the "low-temperature generation + standard SFT"16 in SePT is not an empirical trick for improving mathematical reasoning, but an important design supported by both theory and experiments.
Comparison of temperature coupling and decoupling solutions for Qwen2.5-Math-7B. Values in parentheses indicate the difference between the method value and the baseline value (Method−Baseline).
The table above shows that decoupling generation temperature from training temperature indeed yields better results; why it is better can also be understood from the base model's own temperature–performance trade-off. As shown in the following figure, the optimal varies for different metrics, which is the intuitive motivation behind SePT not binding
and
together.
Results for Pass@1, Pass@8, Pass@32, and AVG on the base model across varying sampling temperatures.
Does Self-Training Harm the Model's General Capabilities?
Will the model's general capabilities be compromised by continued training solely on self-generated mathematical trajectories? This paper provides a positive answer to this question with a set of general-domain benchmarks on Qwen2.5-Math-7B, including IFEval, BBH, GPQA, MuSR, and MMLU-Pro. The results show almost no degradation: the base model scored 23.4/47.5/29.9/41.4/32.1, while SePT scored 23.6/47.3/30.6/41.5/32.2. That is, SePT shows slight improvements on IFEval, GPQA, MuSR, and MMLU-Pro, with BBH remaining essentially unchanged; GRPO exhibited a similar pattern. This indicates that the SePT self-training method does not significantly harm the model's general capabilities.
Evaluation results of the Qwen2.5-Math-7B base model and its SePT and GRPO trained versions in the general domain.
The Code is Simple and Usable
The project team's code implementation is based on ByteDance's open-source verl framework. More importantly, SePT itself is not tied to any specific framework. Its method is straightforward: generate samples at a sampling temperature , perform standard SFT, and then have the updated model generate the next round of training data. Precisely because this training loop is sufficiently lightweight, SePT can not only be naturally implemented on verl; for teams that already have a training framework or use other online training frameworks, it is also relatively easy to migrate and reproduce.
© THE END
Reprint requests must be authorized by this official account.
Submissions or inquiries for coverage: liyazhou@jiqizhixin.com