DeepMind World Model Researcher: Is the Transformer Architecture Unimportant? The AGI Bottleneck Lies Elsewhere

Demis Hassabis made a judgment in a CNBC interview at the start of the year: AGI is missing one piece of the puzzle, which might be the world model.

He maintains the prediction of "achieving AGI in 5-10 years." When he founded DeepMind in 2010, he estimated it would be a 20-year task, and now it seems progress is on track. However, he also admits that while Scaling Laws are still effective, the returns are diminishing. "'Diminishing returns' and 'zero returns' are two different things; we are still in the stage where 'very good returns are worth continuing to invest in.'" The key is not whether scaling laws have hit a ceiling, but whether they can take us to AGI alone. Hassabis's judgment is: probably not.

He uses "jagged intelligences" to describe current large models: they perform remarkably well in certain dimensions, but when the questioning method changes, they reveal flaws. True general intelligence should not have such inconsistencies. The key capabilities currently missing in LLMs include: inability to continuously learn new things, inability to truly create original content, and inability to propose new scientific hypotheses.

So, the world model might be that missing piece. Its difference from LLMs lies in the fact that LLMs primarily process text and static content, but the ability to understand causal relationships in the physical world and perform long-term planning is missing. "If you want to explain previously unknown things in the world—this is exactly what scientific theories do—you must have an accurate model of how the world works."

DeepMind has multiple parallel research directions in the world model. Hassabis mentioned Genie (interactive world model), which focuses on generating interactive 3D environments from text or images. The latest Genie 3 can generate interactive worlds in real-time at 720p and 24fps, used for training embodied agents. VEO (video generation model) focuses on high-quality video generation, demonstrating a deep understanding of physics. Genie 3 is built on the physical understanding foundation of VEO 3.

However, Hassabis's interview is from a CEO's perspective, discussing more strategic aspects and fewer technical details. What is the specific mechanism for using world models to train agents? Where are the current bottlenecks? I couldn't find good answers to these questions.

Then I came across Danijar Hafner's podcast interview (BuzzRobot channel). He is a Google DeepMind Staff Research Scientist and the author of the Dreamer series. Dreamer is another research direction for DeepMind's world model, with a different focus from Genie/VEO—this will be explained in detail later. Hafner not only does cutting-edge research but also scales models to the scale of cutting-edge video models. His perspective combines theoretical depth and engineering pragmatism.

Speaking of which, it's quite risky for researchers in AI labs to publicly discuss internal progress. xAI researcher Sulaiman Khan Ghori did a podcast last week, discussing many internal details: the company's flat structure, daily adjustments to models on the "Macrohard" project, plans to use idle Tesla vehicles as "human simulator" agents, and scaling to a million such AI workers. The podcast went live on January 15th, and he resigned on Monday, changing his personal profile to "MACROHARD @xAI prev." The outside world speculates he was asked to leave for leaking too much information.

In contrast, Google is much more open. Hafner discussed many DeepMind world model progress in the podcast, including some unpublished scaling experiment results.

World Model: Learning in Imagination

First, let's clarify the concept.

The core idea of the world model is: instead of letting a robot fall ten thousand times in the real world to learn to walk (expensive, dangerous, slow), first learn a model that can predict changes in the physical world, then train extensively in this "imagined" world. Falling ten thousand times in imagination costs almost nothing.

This differs from traditional reinforcement learning in that traditional methods let the agent interact directly with the environment through trial and error, with each trial having a cost. The world model's approach is to first learn to predict "what the environment will become if I do X," then have the agent practice extensively in this predicted world, and finally verify in the real environment.

Dreamer's positioning is different from Genie. Genie focuses on "environment generation"—generating diverse interactive 3D environments from text or image prompts, allowing users to navigate and explore inside. Dreamer focuses on "agent training"—using reinforcement learning to train agents to complete specific control tasks within an accurate world model.

The technical differences between the two are obvious. Hafner pointed out in the Dreamer 4 paper that Genie 3 only supports camera actions and a universal "interact" button, while Minecraft requires a complete mouse and keyboard action space. Genie can generate diverse scenarios, but "still has difficulties in learning precise physics of object interactions and game mechanics." Dreamer's advantage is accurate physical prediction—it truly learned to break blocks, use tools, interact with workbenches, and other game mechanisms—as well as real-time inference on a single GPU.

This is why Hafner's research is closely related to video prediction. Video prediction is essentially learning a world model. If a model can accurately predict the next frame of a video, it "understands" the operating laws of that part of the physical world to some extent. To predict how an object moves, you must know its mass, friction, what the other side looks like (because it might rotate), how objects interact with each other, and how humans interact with objects. All this information can be extracted from video prediction.

The Dreamer series has iterated to the fourth generation, with each generation solving different problems.

The first three generations focused on online learning—learning from scratch through interaction with the environment, pursuing data efficiency and final performance. Up to Dreamer 2, model-based algorithms learned quickly but maxed out; model-free methods required more data but had higher ceilings. Dreamer 3 finally achieved both speed and strength without tuning hyperparameters. They validated this with the Minecraft diamond challenge—learning from scratch with sparse rewards to obtain diamonds, widely considered an AI milestone.

Dreamer 4 is completely the opposite, focusing on offline learning. Hafner's original words were: "Wait, we already know how to do online learning, what about offline learning?" Sometimes interacting with the environment is dangerous, and you only have a fixed human dataset. How strong a strategy can you extract from it? They also validated with the Minecraft diamond task, but this time using only human data—and the amount of data used was only 1/100 of OpenAI's VPT offline agent.

Neither is a perfect solution; they just solve specific problems in isolated experimental settings. Naturally, these will be integrated in the future.

Architecture is Unimportant, These Four Things Matter

Hafner has a counterintuitive judgment:

Almost any architecture can take us to AGI.

Transformer can reach AGI, RNN can too; the difference is just computational efficiency and adaptation to current hardware. RNN trains slower, infers faster, and might require larger models to compensate for architectural bottlenecks, but ultimately they can all reach AGI. Therefore, the debate over architectures like Transformer vs. Mamba vs. SSM is, in Hafner's view, more about efficiency issues than fundamental problems.

So what is important? Hafner listed four things:

compute, objective functions, data, and RL algorithm details. For example, long-term credit assignment needs to be done better than basic RL. Architecture is just the container that carries these.

Another related judgment:

"The question of whether LLMs can take us to AGI is already outdated". Why? Because the cutting-edge models deployed today are no longer pure LLMs—they have image understanding, image generation, video understanding, and video generation is about to be integrated. Discussing "the limitations of LLMs" is like discussing "whether cars can fly"—cars can't, but cars with wings can.

So what is still missing for AGI? Hafner pointed out several specific capability gaps.

Long-context understanding. Current models claim to have a million-token context, but for videos, it's far from enough; the token volume of videos is too large. And even with long context, the model's ability to truly retrieve and reason based on the entire context is not yet in place. Possible directions include: hybrid retrieval models, learning state representations while doing attention, and associative memory similar to Transformer but without the need for backtracking. Hafner mentioned that there were many cool ideas before Transformer, but it was too early at the time—"what was important then was not long-term memory or fancy addressing mechanisms, but scaling up and computational efficiency."

Reasoning beyond human capabilities. It's easy to learn reasoning from humans, but then you're locked in by human capability limits. AI systems should be able to discover reasoning methods themselves. This means extracting abstract concepts from raw high-dimensional data (video, audio, human life data, robot data) and then planning on these concepts. Hafner admitted: "I think we haven't mastered how to do this well yet."

Fundamental limitations of In-context learning

This is an important but easily overlooked discussion in the podcast.

When training neural networks, you optimize them with an objective function, and the more you train, the better. But in-context learning is a completely different mechanism. Hafner said: "You just hope the model learns to generalize in a way that looks like learning.

But there is nothing in the system that makes it aggressively optimize any objective. It doesn't really strive to remember, doesn't really strive to understand patterns in the context."

We can train these capabilities into the weights by constructing clever training samples (forcing the model to solve puzzles, remember things), but that is ultimately a learned algorithm, which may not be as goal-directed as truly optimizing.

A possible direction is nested learning: letting part of the model learn the context quickly during inference, rather than discarding the context like GPT does after passing it. Hafner pointed out a fundamental problem:

"You can't optimize during inference, so no amount of pre-training can foresee what will be input during inference."

He also mentioned the need for multiple learning time scales. Faster time scales train more efficiently, while slower time scales learn deeper things. He can imagine a general algorithm where you can say "I want k=5 learning time scales." There is no algorithm that truly works in this space yet, but it's an interesting direction.

There is a way: if you have a million users, maybe you can batch the interactions of 10,000 users together for one update, and the model will truly learn deeply. Now, the data generated by user interactions after GPT-4 is released will take 1-2 years to affect GPT-5. Can we shorten this cycle to days, or even seconds? Theoretically possible, but the challenges are huge: training large models is too expensive, maintaining security during online updates is difficult, and static models are easier to study and patch quirks.

These ideas—nested learning, multi-time-scale learning, continuous learning—are many inspired by neuroscience. Hafner mentioned an interesting point of view: Hassabis's mentor Thomas Poggio said that in 2015, Demis thought building general intelligence was 80% neuroscience and 20% engineering; recently updated to 90% engineering. But Hafner believes,

"Since we have pushed engineering so far recently, the value of going back to neuroscience for intuition is actually increasing."

Scaling Discoveries: The Ceiling for Video Models is Still Very High

Hafner revealed some unpublished results: they scaled the world model to the scale of cutting-edge video models, and the results were good.

More importantly, his judgment:

The scaling ceiling for video models is at least an order of magnitude higher than that of text models.

Why? Because video contains far more information than text. Even top video models,

"are basically underfitting". Current video models collapse to generate beautiful movie clips, but if the goal is to truly understand the physical world (not just generate beautiful videos), there is huge room for scaling.

Hafner said, the larger the model, the sharper all aspects become. For example, inventory prediction (in Minecraft), if the model is too small, it won't be accurate. You can specifically collect data in this area to improve it, but another way is to make the model 8 times larger, and it will naturally become very good at inventory dynamics. They also did a complete YouTube pre-training experiment—crawling large datasets, filtering quality, training on them—and only then did they truly see strong generalization benefits.

This echoes Hassabis's judgment. Hassabis said the world model might be the missing piece of the puzzle to AGI, Hafner tells you from an engineering perspective: we have only tapped a small part of the potential of this puzzle.

However, Hafner also mentioned the limitations of world models. Dreamer 4 encountered counterfactual problems when training with only human data: human players never try to make a pickaxe with wrong materials (e.g., a wooden pickaxe with diamonds), so the world model doesn't know those recipes don't exist, and the RL agent will exploit these vulnerabilities—it looks like it's making a pickaxe, the world model says "okay, here's a pickaxe," even though that recipe doesn't exist at all.

The solution is to correct with 2-3 rounds of environmental interaction data, and the problem disappears. Here is an important dynamic: RL agent will find all potential vulnerabilities of the world model, then get feedback after deployment in the real environment, forming an adversarial game. Eventually, the world model will become robust, and the strategy will also become stronger.

In other words,

pure offline data cannot be perfect in the real world; it must interact with the environment to learn a true causal model.

Objective Function: An Undervalued Design Space

Hafner believes the objective function is an undervalued direction for improvement.

He divides objective functions into two categories. One is

preference-based (reward, inductive bias): specified by humans, no mathematical formula can describe it, must be learned from human feedback. The other is

information-based (prediction, reconstruction, curiosity): letting the model understand the data itself. Both sides have great room for improvement.

For text, next token prediction can go far, but there is more that can be done—like predicting multiple tokens at once, which can make the model more forward-looking.

For multimodal, it's basically a patchwork of various losses: visual encoder uses contrastive loss, text uses next token, image generation uses diffusion, and you need to balance all these losses. Hafner thinks there might be a way to unify everything, "making our lives simpler and ultimately getting better performance." Different losses benefit different modalities, but he believes this is not a fundamental trade-off; if it can be abstracted, the benefits can be shared across modalities.

For agents, short-term RL (within 1000 steps) is now stable, but end-to-end optimization of long-horizon tasks is not yet feasible, with errors accumulating at each time step. Exploration objectives, goal-reaching objectives, and general robust reward models—all lack good objective function designs.

Hafner's judgment is: "

The only thing missing is basically the objective function. You can say we don't have data, but in fact, the data is there, and manual collection is not difficult. What's really missing is the idea of how to build such a system. We've done so much scaling and data engineering, we're already very good at these, and we shouldn't stop. But these are not so difficult now,

we have returned to the stage of working on algorithms."

The Division of Labor Between Pre-training and Reinforcement Learning

Pre-training learns knowledge from samples, efficiently, suitable for absorbing information. Reinforcement learning learns strategies from rewards, suitable for optimization.

Hafner explained why RL is harder to learn knowledge than pre-training: using reward to learn, you must first guess a knowledge point, and then the model is told whether the guess is correct. This is much less efficient than absorbing information directly from samples.

But RL is irreplaceable in optimizing strategies. The key problem is: obtaining optimal control data is almost impossible. Human data is not optimal; if you let contractors collect data, you might have to throw away 99%, and optimal also depends on horizon length—ideally, you need to be optimal for a very long horizon. This is the value of RL—you don't need optimal data, you just need to let the model find better strategies through trial and error.

Humans are the same: learning knowledge through observation (predicting what will happen next), and learning skills through trial and error (reinforcement learning). Observation can also learn some rough but imprecise skills, because the mental representations we use when predicting what others will do are similar to our own representations, so we can generalize to imagining ourselves doing those things.

Significance for Robots: Two Waves of Impact

Hafner believes the impact of world models on robots will come in two waves.

The first wave is representation. Representations learned from video prediction models have a far better understanding of the physical world than current VLMs. Precise object positions, physical properties (how slippery is this plate? How tightly must I hold this cup so the tea doesn't spill? If I pick up this cup by the handle, how tightly must I hold it so it doesn't slip out of my hand?)—these information crucial for control are by-products of video prediction.

Training strategies from scratch requires a lot of data, resulting in narrow and fragile strategies that only work in specific scenarios; using pre-trained VLMs is better, but those representations are not designed for physical-level world understanding. Using representations from video prediction models for imitation learning has already achieved massively better results.

The second wave is virtual training. With enough diverse pre-training plus fine-tuning with a small amount of robot data, the world model can simulate the performance of robots in any scenario. Hafner's original words were:

"You can, in the data center, let robots work in a million kitchens, make a million kinds of meals, all trained in parallel. You don't need to really rent a million Airbnbs, build a million robots, and ship them to various cities."

There are still challenges in doing this on a large scale, but Hafner believes this is the second step change in the robot field. Dreamer 4's paper shows the complete recipe: add agent token to train BC strategy, then train reward model, then RL fine-tuning.

Regarding the timeline, Hafner gave an estimate:

Robots may make good progress toward the first version of practical general-purpose robot products within three to five years. Complex long-term reasoning may take 5-10 years to crack, but practical general-purpose robots don't need to wait that long.

This aligns with Hassabis's judgment. Hassabis said in the interview that there will be very interesting progress in the robot field in 2026, and DeepMind is doing ambitious projects with Gemini Robotics. The CNBC host was skeptical—many robots are actually "puppets" controlled remotely by people in control rooms (like Tesla's Optimus robot). But precisely because of this, the world model is important: for robots to operate truly autonomously, they need to understand the physical world.

Why LLMs Produce Hallucinations in Edge Cases

Hafner has an interesting explanation involving the relationship between agents and the environment.

Agents converge to a distribution where they can reasonably achieve goals and reasonably predict what will happen. The system trains more on that data, and the allocated model capacity is also on that related distribution, so it rarely fails. But it also starts to forget other things.

Another way to build a stronger system is to make it bigger, train with more data, and expand this niche. But there will always be places on the edge of the distribution where the model fails, generalizes poorly, and produces hallucinations.

Hafner said: "I think this is what we are seeing now on LLMs—they are quite general and quite good on most things within the distribution, but they stumble, generalize incorrectly, and produce hallucinations on the edge."

Doing some online RL will help refine the system: if it produces hallucinations, the user is dissatisfied, and it gets negative reward, then it either learns the correct answer or learns to say "I don't know," eventually landing on a very solid distribution.

Summary

Putting Hafner and Hassabis's views together, there are several cross-validated judgments:

World model is an important direction. Hassabis said this might be the missing piece of the puzzle for AGI, Hafner is doing this at the front line, and revealed that scaling to the cutting-edge scale works well.

The potential of video models has not been fully released. The scaling space is at least an order of magnitude larger than text, and top models are still underfitting.

Architecture is not the bottleneck. Transformer and RNN can both reach AGI; what really matters is compute, objective functions, data, and algorithm details. Current models are no longer pure LLMs.

There are fundamental limitations to In-context learning. There is no real objective optimization, and breakthroughs may require learning during inference and multi-time-scale learning.

Robots will make substantial progress in 3-5 years. There is no need to wait until the long-term reasoning problem is completely solved. The world model will promote this from two directions: representation and virtual training.

Hafner finally said that this field is too interconnected to truly differentiate. With the cost of training large models, it makes sense to train only once and get a model that benefits across domains. Agents are already part of cutting-edge models, and video generation, though still separate now, may have a powerful omni model with the same weights within a year.

Learning to reason seems conceptually challenging and may take 5-10 years. But practical things will appear faster than we imagine.

DeepMind World Model Researcher: Is the Transformer Architecture Unimportant? The AGI Bottleneck Lies Elsewhere

Related Articles

分享網址