MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with an audience covering NLP master's and doctoral students, university teachers, and corporate researchers.

The community's vision is to promote communication and progress among the academic, industrial, and enthusiast circles of natural language processing and machine learning at home and abroad, especially for beginners.

Source | Machine Heart

Diffusion Language Models (DLLMs) have attracted much attention due to their various potential characteristics, such as accelerated non-autoregressive parallel generation capabilities, the ability to directly draft and edit, and data augmentation capabilities. However, their model capabilities often lag behind those of equally powerful autoregressive (AR) models of the same scale.

Recently, Huazhong University of Science and Technology and ByteDance jointly launched Stable-DiffCoder. This is not just a new diffusion code model, but also an in-depth exploration of whether "diffusion training can enhance the upper limit of model capabilities."

Stable-DiffCoder, under the condition of completely reusing the Seed-Coder architecture and data, successfully achieved performance surpassing by introducing Block Diffusion Continual Pre-training (CPT) and a series of stability optimization strategies. On multiple mainstream code benchmarks (such as MBPP, BigCodeBench, etc.), it not only defeated its AR prototype but also surpassed a series of powerful open-source models like Qwen2.5-Coder, Qwen3, and DeepSeek-Coder at the 8B scale, proving that the diffusion training paradigm itself is a powerful data augmentation method.

Paper Title: Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Paper Link: https://arxiv.org/pdf/2601.15892
Github Link: https://github.com/ByteDance-Seed/Stable-DiffCoder
Model Link: https://huggingface.co/collections/ByteDance-Seed/stable-diffcoder

Diffusion Process Struggles to Efficiently Learn Sample Knowledge

Although the diffusion process can seemingly expand a lot of data and serve as a data augmentation method, it actually introduces a lot of noise and even erroneous knowledge learning.

For example, in the following case:

Mask it as:

It can be found that for the last mask_n, it can only learn a+b=7 when it sees a=1, b=2, forming an erroneous knowledge mapping. At best, it can only learn that a=3, b=4 has a higher co-occurrence probability in the context of a+b=, and cannot learn the clear addition rule.

Token inference knowledge and process design

The paper explains this phenomenon by modeling the learning of this knowledge:

Assume c is the currently visible sample. According to the real distribution, the set of tokens that can be inferred at the current position based on these samples is C(c), with a size of K(c) (the scenario of multiple tokens being inferred simultaneously is consistent, so only single token inference is considered simply). Since it is defined using the real distribution, the cleaner and more c there is, the smaller K(c) becomes.

It is known that the distribution the model ultimately hopes to learn is , and to learn this process well, two conditions need to be met: (1) K(c) is small; (2) The c sampled from the data should be as numerous as possible.

Therefore, if using a pure bidirectional diffusion process, when the mask ratio is large, the c seen by the current token becomes smaller, the probability of being unclean increases, leading to a larger K(c), making it difficult to map to clear rules. At the same time, it will produce various c, and the average learning amount per c will decrease. Additionally, it is necessary to ensure that the c sampled during training is consistent with the c used during inference to better utilize the knowledge learned during training.

Next, the paper further explains and proves this conclusion by designing experiments on a 2.5B model. The paper initializes from an AR model and trains a new piece of knowledge. The paper designs 3 training methods to explore:

(1) AR->BiDLLM: Continue training in an AR manner, and at 100k steps, CPT to a bidirectional DLLM.

(2) ARDLLM->BiDLLM: Use the AR structure but train with a pure bidirectional sampling mode. Then CPT to BiDLLM at 100k steps.

(3) BiDLLM: Train using a pure bidirectional DLLM.

It can be found that the final effect is (1) > (2) > (3), which also conforms to the previous theory. The (1) scheme without random [MASK] has a faster compression speed for knowledge, and when converted to BiDLLM, it also maintains the best performance. This can prove that to efficiently learn a DLLM, knowledge compression can be performed using AR or block diffusion with a small block size. Interestingly, when block=32, the performance of (1) and (2) is worse than (3), but after 100k steps, the performance is better than (3). Before 100k steps, it can be shown that the c sampled by AR does not match the c in the inference process with block size=32. However, due to AR compressing a large amount of useful knowledge, a slight CPT can adapt to this inference process. At the same time, it can also show that the prior of the AR structure may be more suitable for the inference process that starts from the left, such as prompt+response.

Therefore, we design the training process as first compressing the knowledge with AR, and then using the checkpoint before AR annealing to continue CPT into a small block block diffusion, to explore the data augmentation capability of the diffusion process.

Stable DLLM warmup strategy for continual pre-training design

The continual pre-training of diffusion models is usually very sensitive to the design of hyperparameters (such as learning rate), and it is prone to abnormal increases in grad norm, which can also be affected by various training architectures. To maintain learning stability across various training architectures and avoid complex parameter tuning processes, the team designed an adaptive warmup strategy.

The instability of the DLLM CPT process is mainly affected by the following 3 reasons:

(1) Attention changes from unidirectional to bidirectional

(2) Increased mask makes the task more difficult

(3) To align with ELBO, a weighted coefficient is multiplied in front of the cross-entropy. For example, if only one token is masked, it is equivalent to only calculating the loss of this token, which will greatly increase the impact of this token on the gradient, thereby affecting grad norm and loss.

Since annealing attention is difficult to flexibly adapt to architectures like flash attention, the team designed the warmup process targeting (2) and (3). Specifically, during the warmup phase, the upper bound of the mask ratio is gradually warmed up to the maximum value, making the task start easy and become difficult.

Secondly, during the warmup phase, the weighted coefficient in the cross-entropy is removed, making the impact of each token on the loss more stable:

Block-wise truncated noise schedule

When using block diffusion, because the clean prefix is spliced through cross-attention, each token can produce useful loss. However, if using a traditional noise schedule, some blocks will not produce loss signals. By solving the integral, the probability that a block does not produce a signal can be calculated as follows, which is particularly obvious for small blocks:

Therefore, the team made two designs: (1) Force each block to sample a token (2) Set the lower bound of noise sampling to 1/B, which can ensure that at least one token is expected to be sampled. At the same time, it can avoid the problem of the original corresponding t being too small after forcing the sampling of 1 token, which would cause the cross-entropy weighting to be too large.

Experimental Results: Leading in multiple code benchmarks at the 8B scale

For Base models

Stable-DiffCoder-8B-Base performs excellently in code generation, multi-code language generation, and code reasoning. It surpasses a series of AR and diffusion-based models. Additionally, it can be found that the model has been significantly enhanced compared to the AR baseline in sparse code languages (such as C#, PHP, etc., with less data in pre-training), which can prove that the DLLM training process has played a certain data augmentation effect. At the same time, the code reasoning ability has also been enhanced.

For Instruct models

Stable-DiffCoder-8B-Instruct has been comprehensively evaluated on tasks such as code generation, code editing, and code reasoning, and has shown superior performance. Among them, it significantly outperforms the original AR baseline and other DLLM models of around 8B scale in common tasks (humaneval, mbpp). It reaches the level of qwen32B on the closed-source test set MHPP, and on BigCodeBench, it surpasses a series of models and is second only to the DeepSeek236B model. At the same time, it has amazing effects on the code editing task CanItEdit.

Summary and Outlook

The release of Stable-DiffCoder breaks the stereotype that "diffusion models can only do parallel acceleration." It proves that: the diffusion training paradigm itself is an excellent representation learning method. Through reasonable curriculum design and stability optimization, diffusion models can completely surpass traditional AR models in code understanding and generation quality.

For the future evolution of large models, Stable-DiffCoder suggests a new path: perhaps we do not need to abandon AR, but use AR as an efficient knowledge compressor, and then use Diffusion as a "reinforcer" to further push the upper limit of model intelligence.

Stable-DiffCoder Surpasses Autoregressive Models! New Breakthrough in Code Generation with Diffusion Models

Diffusion Process Struggles to Efficiently Learn Sample Knowledge

Related Articles

分享網址