Homepage: http://qingkeai.online/

Author: Yi Mu Bu
Source: https://zhuanlan.zhihu.com/p/2004306938188537902

Regarding On-Policy work, I feel that much of it is selling concepts without substantial innovation. For instance, sometimes On-Policy Distillation is called SFT, other times RL, but it might as well just be called Distillation.

The generally agreed-upon points are:

On-Policy Distillation improves performance compared to GRPO while avoiding excessive "Aha Moments".
On-Policy Distillation can mitigate catastrophic forgetting.
On-Policy Distillation is naturally suited for combination with GRPO, providing dense token-level reward signals.
The difficulty of On-Policy Distillation lies in obtaining the Teacher model; Self-Distillation, which uses the Policy model itself as the Teacher model, has been verified as feasible.
The implementation of On-Policy Distillation is compatible with RL, making it relatively easy to develop directly on RL frameworks.

This article focuses on introducing On-Policy/Self-Distillation (OPSD), which involves using the Policy itself as the Teacher model.

1. Objectives and Gradients of On-Policy Distillation

On-Policy Distillation aims to minimize the KL divergence between the student policy and the teacher policy over the trajectory distribution generated by the student policy itself:

Mathematical formula for KL divergence minimization

Here, KL can be either Reverse KL or Forward KL. References [1-2] use Reverse KL, while reference [3] uses Forward KL.

For Forward KL, the gradient can be derived as:

Gradient derivation for Forward KL

For Reverse KL, the gradient can be derived as:

Gradient derivation for Reverse KL

It can be observed that this is actually very similar to the objective of RL; both contain the term ∇logπ, but the weighting factors preceding it differ. In RL, the weighting is based on Reward or Advantage.

2. On-Policy Self-Distillation

Self-Distillation aims to use the Policy itself as the Teacher model:

Self-Distillation formula

Here, sg represents stop gradient, K represents extra knowledge, and K can be obtained via f(x).

The key to On-Policy Self-Distillation lies in the construction of the Teacher. The method of constructing the Teacher in OPSD shares similarities with MoCo, SimCLR, DINO, SigLIP, etc.

Extra knowledge needs to be introduced, relying on In-Context Learning capabilities.
Since the Policy model is constantly updating, the parameters of the Teacher model need to be relatively stable.

These two points are what will be discussed in the implementation details below.

3. Implementation Details

3.1 How to Introduce Extra Knowledge

Currently, there are two observed methods:

Method 1: Directly revealing Ground-Truth to the Policy model for reference

Diagram showing Ground-Truth revelation method

Method 2: Derived from environmental feedback

Diagram showing environmental feedback method

3.2 How to Determine Teacher Model Parameters

Directly using the original frozen model works in the early stages but leads to collapse later on.
Using the Policy model itself as the Teacher model is also feasible, but the effect is not as good as EMA (Exponential Moving Average).
Trust-region and EMA yield similar results; both aim to obtain a more stable Teacher, avoiding drastic changes during the optimization process.

The update strategy for Trust-region is as follows:

Trust-region update strategy diagram part 1

Trust-region update strategy diagram part 2

Trust-region update strategy diagram part 3

4. Advantages of On-Policy Self-Distillation

4.1 Mitigating Catastrophic Forgetting

The work "Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning" studied this relatively early, although the concept of On-Policy was not yet popular at the time.

Graph showing mitigation of catastrophic forgetting

Comparison graph of forgetting rates

4.2 Reducing Aha Moments

Graph showing reduction in Aha moments

Performance comparison regarding Aha moments

4.3 Reducing Train-Test Gap

Exposing the student model to the test-time distribution during train-time in advance helps mitigate exposure bias.

5. Scaling On-Policy Self-Distillation

Normally, the larger the model size, the more it surpasses GRPO. This is because larger models typically possess stronger in-context learning capabilities.

Scaling law graph for model size vs performance

Reference

[1] Reinforcement Learning via Self-Distillation
https://arxiv.org/html/2601.20802

[2] Self-Distillation Enables Continual Learning
https://arxiv.org/html/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
https://arxiv.org/html/2601.18734