Kaiming He's Team Debuts First Language Model! 105M Parameters, 45B Training Tokens, Continuous Diffusion Route Outperforms Mainstream Discrete DLMs

Kaiming He's team has brought a new work—a language model.

This time, what he led to build is not the familiar autoregressive paradigm of "next token prediction" behind systems like ChatGPT.

Instead, it is another new route that has become hugely popular in the image domain in recent years and is now being increasingly ported into text generation: Diffusion Language Models (DLM).

In their latest paper, Kaiming He's team released an entirely new continuous diffusion language model: ELF: Embedded Language Flows.

ELF model overview diagram

Unlike many language models that still perform diffusion at the token level, ELF keeps the entire generation process within a continuous embedding space, only re-discretizing the representation back into tokens at the very last step.

With this design, ELF, using only 105M parameters, 45B training tokens, and 32 sampling steps, has directly outperformed a number of mainstream diffusion language models.

The most intuitive metric is that on OpenWebText, it drove the Generative Perplexity all the way down to 24.

To briefly explain generative perplexity, it essentially involves having a powerful language model "check the homework" of the generated results to see if the text resembles real, human-written corpora—

the lower the value, the higher the generation quality, meaning the model's output feels less "AI-ish" and more natural.

In comparisons with mainstream diffusion language models, ELF, using nearly 10 times fewer training tokens and fewer sampling steps, actually achieved lower generative perplexity.

Comparison chart of generative perplexity between ELF and other models

It can be said that for a long time, progress in diffusion language models almost entirely occurred on the Discrete DLM side.

ELF proves one thing for the first time: the continuous approach not only works, but works well.

What Exactly Did ELF Do

To understand ELF, one must first understand what diffusion language models are currently doing.

Diffusion language models mainly follow two technical routes.

One is the discrete camp, represented by MDLM, Duo, which performs diffusion directly in the token space, processing discrete random variables at each step.
The second is the continuous camp, including Diffusion-LM, CDCD, DiffuSeq, which map tokens into continuous embeddings and then denoise in the continuous space.

Comparison diagram between discrete and continuous diffusion approaches

In previous research, discrete routes like MDLM, LLaDA, and Dream 7B held the upper hand. The reason is simple: because language itself is discrete.

Regarding this seemingly common-sense understanding, Kaiming He's team gave a judgment that was exactly the opposite—

Maybe the problem isn't that "language must be discrete"; maybe the problem is: previous approaches never let the continuous route be continuous all the way through.

Methods like Diffusion-LM denoise in the embedding space but calculate a token-level cross-entropy at every step, tying the continuous trajectory to the vocabulary at each turn.

Later models like LD4LG and Cosmos follow a latent diffusion route, where the denoising process is continuous, but they require training a separate decoder to map the latent back to tokens, essentially adding an extra module.

Based on this, ELF keeps all denoising entirely within the continuous embedding space; only at the final step t=1 does it project back to tokens.

Illustration of ELF's continuous diffusion process in embedding space

Specifically, during ELF's training, discrete tokens are first encoded into continuous embeddings. Noise is then added to form z_t, and the model is tasked with either reconstructing the clean embedding (MSE) or directly predicting the token (CE).

Diagram of ELF's training process with embedding and loss functions

During inference, the model starts from Gaussian noise z_0, denoises all the way within the continuous space, and only at the final step switches to decode mode, projecting the embeddings back into tokens.

ELF completely separates the two aspects—"continuous representation" and "discrete output"—which were previously thought to require constant alignment:

The intermediate denoising is fully entrusted to the continuous space; the final language generation is left to the very last step of discretization.

There is no hard alignment onto the vocabulary at every step, and no need to train an extra decoder. For the first time, the entire generation process truly achieves:

Continuous is continuous, discrete is discrete.

And this is precisely the key reason why ELF can outperform a host of diffusion language models using fewer sampling steps and fewer training tokens.

ELF Is Not "Diffuse First, Then Decode."

In its specific implementation, ELF also tackles three problems:

How do tokens become continuous? How does one denoise in the continuous domain? And finally, how does one convert back to tokens?

1. Turning Tokens into Continuous Embeddings

To apply continuous diffusion to language, the first step is to transform discrete tokens into a continuous representation.

In the paper, ELF first slices the input into a token sequence, then maps it into a continuous embedding space. There are actually several choices for this mapping.

By default, ELF uses a T5 pre-trained encoder to generate bidirectional contextual embeddings. Later in the paper, they also test different schemes like jointly trained embeddings and random embeddings.

It's worth noting that this encoder is only used during the training phase and does not add extra modules during inference.

2. Doing Flow Matching in the Continuous Embedding Space

Once the continuous representation is obtained, ELF performs Flow Matching within the embedding space. Simply put, Flow Matching defines a continuous flow trajectory from noise to real data:

At t=0, it is Gaussian noise;
At t=1, it is the clean embedding;
All intermediate states are linear interpolations of the two, which is the rectified flow mentioned in the paper.

In traditional Flow Matching, the network typically predicts the "velocity field" v. But ELF doesn't do this; instead, it follows the idea proposed half a year ago in the Kaiming He team's paper, "Back to Basics: Let Denoising Generative Models Denoise"—

directly predicting the clean embedding x, that is, x-prediction.

Diagram explaining x-prediction and velocity prediction in flow matching

The training objective is to minimize the Mean Squared Error (MSE) between the predicted embedding and the true embedding.

As for why x-prediction is adopted, the paper gives two reasons:

First, it is more stable on high-dimensional representations—such as token embeddings of 768 dimensions or even higher;
Second, it naturally aligns with the goal of "predicting clean tokens" at the final step.

The paper also specifically notes: although it is theoretically possible to predict velocity v first and then convert to x, doing so would make it very difficult for the later denoising and decoding stages to share weights.

Experimentally, they also found that once weights are shared, v-prediction performance degrades significantly.

3. Back from Continuous Embeddings to Discrete Tokens

For generating language, the final output is still discrete tokens.

Therefore, ELF only at the final time step (t = 1) needs to project the continuous embedding back into the token space. However, unlike many latent diffusion methods, ELF does not train an extra decoder for this step. Instead, it directly treats the final step as:

a one-shot continuous-to-discrete decoding.

In other words: the decoder and the preceding denoiser are the same network.

To prevent the final step from being too easy during training (since theoretically, as t→1, the input is already very close to the clean embedding), ELF adds an additional token-level corruption at the last step, constructing a perturbed input.

Subsequently, the same network outputs the clean embedding, which is then projected into token logits via a learnable unembedding matrix W.

The training objective is the standard token-level cross-entropy loss. The entire network shares the same set of parameters and additionally receives a binary mode token: denoising mode / decoding mode.

During inference, ELF starts from Gaussian noise and denoises all the way in the continuous space, only switching to decode mode at the final step t = 1, then outputting the final tokens via argmax.

It's worth mentioning that one of the most common techniques in image generation, CFG (classifier-free guidance), has also been ported over in ELF.

ELF uses self-conditioning as the conditioning signal and applies training-time CFG (one forward pass simulating two inferences, with no inference overhead), directly bringing over the scheme from the image domain.

Experimental Comparison

In the experimental section, ELF basically answers a question that has been hanging in the air for the past two years:

Can continuous diffusion language models actually compete? The answer: not only can they compete, but for the first time, they win simultaneously across three dimensions: quality, speed, and training cost.

As mentioned at the beginning, in the OpenWebText generation task, without any distillation, ELF pushed generative perplexity down to 24 using only 32 sampling steps.

Previously, mainstream discrete diffusion models often required running up to 1024 steps to approach this level.

Bar chart comparing generative perplexity and sampling steps of different models on OpenWebText

Even more striking, ELF achieved this result using only 45B training tokens.

Comparable opponents generally use 500B+. In other words: with an order of magnitude fewer sampling steps and an order of magnitude less training data, the results are even better.

Moreover, on conditional generation tasks where many diffusion models are most prone to falling behind, ELF didn't drop the ball.

Whether on WMT14 machine translation or XSum text summarization, ELF stably surpasses existing diffusion language models, even outperforming many autoregressive baselines.

Table showing ELF's performance on WMT14 and XSum tasks compared to other models

The summary given at the end of the paper is actually quite restrained: ELF achieves a very strong trade-off between generation quality, sampling efficiency, and training cost.

Translated into plain English: the continuous camp isn't incapable. It's just that before, nobody had really taken the "continuous" approach all the way.

Author Introductions

Finally, let's introduce the authors of this paper. The two first authors contributed equally.

Keya Hu, one of the two first authors of this paper, is a first-year PhD student in EECS at MIT. She is also one of the first cohort of PhD students Kaiming He has advised at MIT, currently co-advised by Kaiming He and Jacob Andreas.

Photo of Keya Hu

She completed her undergraduate studies in the ACM Honors Class at Shanghai Jiao Tong University. Her current research interests primarily lie at the intersection of language and vision, aiming to build agents with higher data efficiency and stronger generalization capabilities.

It's noteworthy that on Kaiming He's MIT homepage, Keya Hu is listed first among grad students, making her essentially the "senior lab member" of the group.

Screenshot showing Keya Hu listed first on Kaiming He's MIT lab page

The second first author, Linlu Qiu, is also a PhD student at MIT, studying under Yoon Kim.

Photo of Linlu Qiu

She graduated with a bachelor's degree from the University of Hong Kong and a master's from the Georgia Institute of Technology. Previously, she also worked as an AI Resident at Google.

Interestingly, this is not her first collaboration with Kaiming He. Not long ago, she and Kaiming He's team also won a paper at CVPR 2026 titled "ARC Is a Vision Problem!", which redefined the ARC reasoning problem as a vision problem.

Abstract image from the ARC Is a Vision Problem paper

Another author, Hanhong Zhao, is an MIT undergraduate. He attended the High School Affiliated to Renmin University of China and was a gold medalist in the International Physics Olympiad (IPhO).

Photo of Hanhong Zhao from MIT mathematics department website

There is also author Yiyang Lu, whose background has a bit of a "child prodigy" flavor.

Photo of Yiyang Lu

He is a sophomore in the Yao Class at Tsinghua University and is currently an intern at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), with Kaiming He as his advisor. His main research directions are computer vision and deep generative models.

During high school, he was a physics competition student. He ranked first among Jiangsu contestants and ninth nationally, winning a gold medal at the 39th Chinese Physics Olympiad (CPhO) in 2022.

Previously, he co-authored a paper with Kaiming He as first author, titled "Bidirectional Normalizing Flow: From Data to Noise and Back."

Diagram from the Bidirectional Normalizing Flow paper

Another core author, Tianhong Li, is a postdoc in Kaiming He's group.

Photo of Tianhong Li

He completed his undergraduate studies in the Yao Class at Tsinghua and received his PhD from MIT. He was the first author of the paper "Back to Basics: Let Denoising Generative Models Denoise" published half a year ago.

Furthermore, the paper's other authors include Yoon Kim and Jacob Andreas, two professors in MIT EECS focusing on language models, as well as Kaiming He himself.

Reference Link

https://arxiv.org/pdf/2605.10938