No Reinforcement Learning Needed! Apple's 'Simple Self-Distillation' Achieves Self-Evolution for Coding Models

Currently, the improvement of AI large language models' coding capabilities is constrained by multiple factors: the scarcity of high-quality human-annotated data, the upper limits of teacher model capabilities, and the complex, unstable nature of Reinforcement Learning (RL) methods.

Recently, an Apple team proposed a method called "Simple Self-Distillation" (SSD). This approach samples solutions from the model itself using specific temperature and truncation configurations, then directly uses these unverified samples for standard Supervised Fine-Tuning (SFT).

Experiments show that this method achieves consistent improvements across multiple models in the Qwen and Llama series. On the LiveCodeBench v6 benchmark, the pass@1 metric for the Qwen3-30B-Instruct model increased from 42.4% to 55.3%, with particularly significant gains on difficult problems. SSD offers a complementary post-training direction for enhancing the code generation capabilities of Large Language Models (LLMs).

Paper Link: https://arxiv.org/abs/2604.01193

"Simple Self-Distillation" Without RL

1. What is SSD?

The implementation of SSD involves three steps. First is data synthesis, where researchers sample and generate code solutions from the base model using a higher temperature and specific truncation configurations, generating only one solution per problem. Next is the training phase, where these raw, unverified outputs are used directly as targets for standard supervised fine-tuning. Finally, after training is complete, the fine-tuned model is used for evaluation testing under specific decoding parameter configurations.

This method has significant peculiarities. The entire process relies entirely on neither code execution environments nor test cases to verify the correctness of answers. Simultaneously, it does not require introducing stronger teacher models, nor does it involve complex reinforcement learning algorithms. The data synthesis phase does not filter out incorrect solutions; all raw outputs generated by the model are used directly for training.

Figure | SSD is extremely simple, yet it brings significant improvements in LiveCodeBench v6 across five models spanning two series and three scales, regardless of whether they are instruction or thinking variants. SSD samples from the base model are decoded using temperature T_train during the training phase for fine-tuning on raw output data, and decoded using T_eval during the evaluation phase; this model employs no reinforcement learning, verifiers, teacher models, or code execution environments.

2. Experimental Results

Based on SSD, on the LiveCodeBench v6 benchmark, the Qwen3-30B-Instruct model's pass@1 increased from 42.4% to 55.3% after SSD training, representing a relative improvement of 30.4%. On the larger LCB v5 benchmark, the model's pass@1 also rose from 45.8% to 54.3%, an increase of 8.5 percentage points.

Figure | Performance of LiveCodeBench v6 on Qwen3-4B-Instruct and Qwen3-30B-Instruct models across overall, medium, and high-difficulty test sets (orange indicates the 4B dataset, blue indicates the 30B dataset; dashed lines represent baseline tests, solid lines represent +SSD datasets). The annotations at the bottom clearly present the overall trend: all five evaluated models achieved progress, with the Qwen3-30B-Instruct showing a relative pass rate improvement of 30%, and the largest gains appearing in high-difficulty test tasks.

This improvement is universal. The method is effective not only on Qwen series models but also performs well on Llama series models. Experiments covered models of various scales including 4B, 8B, and 30B. Whether Instruct versions or Thinking versions, applying SSD resulted in performance gains.

Research also uncovered key patterns: performance improvements are concentrated on medium and difficult problems. On difficult problems, the Pass@1 metric improved by 15.3 percentage points. Notably, the improvement in Pass@5 exceeded that of Pass@1. This indicates that SSD does not make the model singular; instead, it enhances the model's exploration capabilities.

Figure | SSD improved performance for all evaluated models on LiveCodeBench, with the largest gains on medium and high-difficulty problems. Results show performance on LCB v6 and LCB v5 datasets, categorized by difficulty and grouped by reasoning style (Thinking vs. Instruct). In each model pair, the first row is the base model, and the second row is the model with SSD; cell shading indicates changes relative to the base row (green for improvement, red for decline).

Why Is Such a Simple Method Effective?

Research points out that SSD's effectiveness stems from its ability to reshape the model's internal probability distribution, thereby resolving the conflict between precision and exploration needs in code generation.

There are two specific types of positions in the code generation process. The first is the lock, where the distribution presents a sharp peak; a very small number of tokens carry most of the probability mass, while a long interfering tail carries the rest, requiring precision. The second is the fork, a position where the distribution spans multiple reasonable tokens, which can lead to significant differences in downstream continuations, requiring exploration.

Figure | Token distribution under low T_eval and high T_eval conditions for two context types, clearly presenting head-tail quality distribution characteristics. Low T_eval conditions maintain lock precision but weaken the effective head of forks (insufficient exploration); high T_eval conditions restore fork exploration capabilities but reactivate the interfering tails of locks (reduced precision).

SSD reshapes the internal probability distribution of the model through high-temperature sampling combined with truncation operations. This method suppresses low-probability interfering items in locks while retaining multiple reasonable possibilities in forks. This allows the model to adopt a higher temperature for exploration during the inference phase without destroying the stability of locks.

Figure | SSD transforms forks into plateau states and locks into sharp peak states. Hatched bars and dashed curves represent the base model; solid bars and solid curves represent the model optimized via SSD; the red dashed truncation line marks the support retained during the SSD process. (a) Fork state: The diffused tail is pruned, but multiple top continuation structures are retained with a more balanced weight distribution, forming a broad plateau above viable branches. (b) Lock state: The same rule applies more aggressive pruning to the tail, concentrating mass on the dominant token to form a sharper peak morphology.

Researchers verified this mechanism through an extreme experiment. They set the training temperature to 2.0 without truncation, resulting in 62% of the generated data being unable to extract valid code. Even with such extremely low-quality training data, SSD still improved model performance, with the Pass@1 metric increasing by 5.7%. This result proves that SSD's effectiveness does not rely on the correctness of the generated code, but rather on the reshaping of the probability distribution.

Figure | (a) When the training set proportion T_train=2.0 and no data truncation is performed, representative samples degrade into meaningless sequences; approximately 62% of output results fail to extract valid code. (b) The fine-tuned model still outperforms the base model's 42.4%/53.5% pass@1/pass@5 metrics, reaching 48.1% and 64.0% respectively.

Implications

Research confirms that models can improve their code generation capabilities by training solely on their own raw outputs. Across five different models, SSD consistently improved performance on LiveCodeBench, with benefits concentrated primarily on more difficult problems.

Code generation combines "precision-bound locks" and "exploration-bound forks." SSD reshapes token distributions, allowing the decoding process to explore useful branches while avoiding the introduction of interfering noise. These findings suggest that powerful existing code models harbor untapped potential internally; this capability can be "unlocked" through simple methods without relying on verifiers, teacher models, or reinforcement learning.

Author: Wang Yueran

For reproduction or submission requests, please leave a message directly in the comments section of this article.

No Reinforcement Learning Needed! Apple's 'Simple Self-Distillation' Achieves Self-Evolution for Coding Models

"Simple Self-Distillation" Without RL

Why Is Such a Simple Method Effective?

Implications

Related Articles

分享網址