In the world of image and video generation, diffusion models have become the dominant approach. So why, when applied to text generation, do they often produce garbled text or repetitive words?

The core issue is that text is fundamentally composed of discrete tokens, whereas diffusion models excel at handling continuous data. In the past, to adapt diffusion models for text generation, researchers have primarily pursued two strategies:

1. Discrete Diffusion Language Models: These define the diffusion process directly in the discrete token space. For instance, they might mask tokens with a [MASK] symbol and then progressively unmask them, or perturb tokens toward a near-uniform distribution before correcting them step-by-step. This method has been the mainstream approach in recent years and generally delivers stronger overall performance.

2. Continuous Diffusion Language Models: This approach first maps tokens into continuous embedding vectors, performs denoising in this continuous space, and then maps the results back to discrete tokens. This route is theoretically more natural and closely mirrors methods used in visual diffusion models. However, its practical effectiveness has historically lagged behind discrete methods.

To tackle this challenge, a team led by Kaiming He, an associate professor at MIT and a distinguished scientist at Google DeepMind, has introduced "Embedded Language Flows" (ELF). This is a new class of diffusion models based on continuous-time flow matching that operates within a continuous embedding space.

Unlike existing diffusion language models, ELF remains in the continuous embedding space for the vast majority of the timesteps. It only maps the result onto discrete tokens at the very final timestep using a shared-weight network. This design allows it to directly reuse mature techniques from image diffusion models.

Paper Link: https://arxiv.org/abs/2605.10938

The results demonstrate that continuous diffusion language models can be highly competitive even with minimal handling of discretization. Without using distillation, ELF achieves a lower generative perplexity with fewer sampling steps, all while requiring only one-tenth of the training tokens compared to previous methods.

Figure: Without distillation, ELF achieves lower generative perplexity than prior DLMs with fewer sampling steps. Simultaneously, ELF uses 10 times fewer training tokens.

Continuous Generation First, then Discrete Decoding

ELF's core methodology involves first mapping discrete tokens into a continuous embedding space. In this space, it uses continuous-time flow matching to model the denoising trajectory from Gaussian noise to clean embeddings. At the final timestep, the model switches to a decoding mode and decodes the result back into discrete tokens.

Figure: A conceptual illustration of ELF. Orange dots represent data representations in the continuous embedding space. The purple lines show the denoising trajectory from Gaussian noise to clean embeddings. Discretization occurs only at the final timestep (t=1) via a shared-weight network.

During the training phase, the research team uses a pre-trained T5 encoder to convert text tokens into context-aware continuous embeddings. Each embedding corresponds to a token but is not a specific word from the vocabulary; instead, it is a vector representation of that token within its context. Subsequently, ELF models the denoising process in this continuous embedding space, learning a continuous flow path from noise to clean embeddings.

During the inference phase, ELF does not invoke the encoder. The model progressively generates text representations in the continuous embedding space and switches to decoding mode at the final timestep, outputting tokens via a shared-weight network and a learnable inverse embedding matrix.

A key design feature of ELF is using a single network to handle both denoising and decoding functions, distinguished by a binary mode token. The model enters the denoising branch and the decoding branch at an 80% to 20% ratio, using MSE loss and cross-entropy loss respectively.

Furthermore, the research team introduced a self-conditioning mechanism. During inference, the model uses the prediction from the previous step as a condition for the next denoising step, rather than starting from scratch. This not only improves generation quality but also provides a readily available conditional signal source for Classifier-Free Guidance (CFG), incurring almost no extra computational burden.

Figure: During training, discrete tokens are first encoded into clean embeddings x, then perturbed into z_t. ELF uses z_t to predict x̂. The model can be trained using one of two losses: a denoising loss L_MSE, or a per-token cross-entropy loss L_CE. During inference, ELF starts from Gaussian noise z_0 and iteratively denoises the embeddings from z_t to z_{t+1}. Only at the final step does ELF switch to decoding mode, projecting the final embedding back to discrete tokens via the inverse embedding layer.

Fewer Sampling Steps, Lower Training Budget

The research team tested ELF on three types of tasks: unconditional text generation on OpenWebText (OWT), machine translation on the WMT14 German-to-English task, and news summarization on XSum.

For unconditional generation, the main model variant, ELF-B, has a size of 105M parameters. In system-level comparisons on OWT, without using additional distillation, ELF-B reduced generative perplexity to 24 using only 32 sampling steps, outperforming other compared discrete and continuous diffusion language model baselines. In terms of training budget, ELF used approximately 45.2B effective training tokens. In contrast, baselines like MDLM, Duo, and LangFlow used around 524.3B, distilled versions like MDLM+SDTT and Duo+DCD used 550.5B, and FMLM used 576.7B.

Figure: System-level comparisons. ELF-B outperforms both discrete and continuous diffusion language models under similar experimental settings (a); it also shows competitive performance against baseline models that require additional distillation training (b); meanwhile, it uses significantly fewer training tokens (c).

For conditional generation, ELF-B achieved a BLEU score of 26.4 on the WMT14 German-to-English translation task. On the XSum summarization task, its ROUGE-1/ROUGE-2/ROUGE-L scores were 36.0, 12.2, and 27.8, respectively. Compared to autoregressive and diffusion language models of similar scale, ELF-B achieved the highest results on both tasks.

Figure: Results on machine translation and summarization tasks. The team evaluated ELF-B on the WMT14 German-to-English (De-En) translation and XSum summarization tasks, comparing it against baselines with similar parameter counts. † indicates results taken directly from existing work (also the default source for the De-En task); ‡ indicates results reproduced by the research team using public codebases (also the default source for the XSum task). For XSum, standard errors across different evaluation samples are reported where available. ELF achieves the best performance in both task settings.

Moreover, ablation studies revealed that context-aware embeddings from a pre-trained encoder outperform both vanilla token embeddings and learnable embeddings. The shared-weight denoiser-decoder performs comparably to a separately trained decoder but with a simpler pipeline. For sampling, an SDE-inspired sampler is superior to an ODE sampler in few-step generation. The research team noted that scaling the model from 105M to 342M and 652M parameters yielded lower generative perplexity at similar diversity levels; at similar generative perplexity, text diversity was higher.

Figure: Ablation experiments on key design choices.

Limitations and Future Directions

The research team points out that the current ELF model still has limitations, primarily in the following areas:

1. Model Scale Remains Limited

The evaluated models mainly have sizes of 105M, 342M, and 652M parameters. ELF was not directly compared against large-scale instruction-tuned models like GPT-4, Claude, or Llama. Therefore, ELF proves its competitiveness among similar diffusion language models, not as a full replacement for mainstream autoregressive large language models.

2. Task Scope Remains Limited

In the research experiments, generative perplexity on OpenWebText is a proxy metric and does not directly represent real-world user preferences. The WMT14 and XSum results demonstrate translation and summarization performance but do not cover complex reasoning, long-context dialogue, code generation, or multi-turn interactions.

3. Reliance on a Pre-trained Encoder for the Continuous Space

The research team tested encoders trained from scratch and non-contextual embeddings, but pre-trained contextual embeddings still performed best. This result suggests that ELF's effectiveness partly derives from an existing pre-trained encoder, rather than learning a continuous language space entirely from scratch.

4. Real-world Deployment Costs Have Not Been Verified

The research team reported sampling steps, training token budgets, and automated metrics. They did not report end-to-end latency, throughput, or memory costs in a real serving environment, nor did they directly compare against established deployment solutions for autoregressive models. Therefore, whether ELF's savings in sampling steps and training tokens translate to real-world deployment costs still needs to be verified.

Kaiming He's Team Unveils 'Diffusion Model' Breakthrough: Discrete Decoding at the 'Last Mile'

Continuous Generation First, then Discrete Decoding

Fewer Sampling Steps, Lower Training Budget

Limitations and Future Directions

Related Articles

分享網址