Editor | Yucheng
A deep interview with Google DeepMind's "game-changing" AI pioneer researcher Mostafa Dehghani is here!
This leading figure's resume reads like a protagonist's script. From the renowned Universal Transformers and Vision Transformer (ViT) to the natively multimodal Gemini series, and even the widely loved Nano Banana image generation model, his core contributions lie behind them all.
In this conversation, Mostafa Dehghani proposed a rather subversive judgment: the biggest bottleneck for AI self-improvement is not compute power or model capability, but "evaluation." Figuring out how much AI has actually improved has even become a philosophical question: if you cannot measure it, you cannot improve it.
He further emphasized that the AI self-improvement loop must be built upon a "grounded" feedback mechanism. You must not work behind closed doors; you must introduce real-world external signals. Although formal verification performs excellently in mathematics and code domains, it is difficult to cover complex real-world scenarios. The industry needs to build alternative mechanisms similar to "tight feedback loops."
Meanwhile, Dehghani gave a phased answer regarding the path of "specialization vs. generalization": in the short term, expert models are an efficient means to explore capability boundaries; in the long run, generalization capability remains the necessary path to reach the ultimate goal of AGI.
Speaking of multimodality, Mustafa believes that because human language tends to describe "anomalies" rather than "norms," language is biased. Native multimodality is a shortcut for AI to understand "common sense" like physical laws and gravity, rather than a mere functional addition.
When discussing Universal Transformer, he stated that deep recurrence and parameter reuse, later termed "negative sparsity," is a concept opposite to Mixture of Experts (MoE). MoE represents "parameters without added compute," while recurrent loops represent "compute without added parameters."
As for the currently red-hot Agent topic, he also poured cold water on it. Facing super long-horizon tasks, even if the success rate of each step is as high as 95%, the comprehensive success rate of completing a 100-step task will be less than 1%. This means that what users directly perceive will be inevitable failure, which poses a huge challenge to establishing social trust.
Interestingly, this big shot almost missed an opportunity back in the day. In 2017, when Mustafa received an internship invitation from the team researching Transformer, he almost wanted to refuse, thinking: "Everyone is working on LSTM, why should I go work with a group of people studying this random architecture? That thing will definitely become obsolete." The result of this "forced" internship directly changed the trajectory of his life.
When developing ViT, the research team also thought of various flashy designs, but they all failed. In the end, what actually worked was the simplest and most brutal idea: directly cutting the image into 16x16 blocks.
Mustafa also revealed that inside DeepMind, the most stunning moments are often when an inference engineer walks past your desk and casually remarks: "By the way, I just sped up the model by 10 times."
Finally, regarding what children should learn, Mustafa admitted that when facing his one-and-a-half-year-old daughter, he also cannot give good advice. But he deeply believes that compared to being a pure subject matter expert, having a global strategic vision and sustained influence is the key to maintaining competitiveness.
Below is the full content of this podcast. Enjoy!
Higher Loops in AI Mean Self-Improvement
Matt Turk: What does a "loop" in AI actually mean? One of the hottest concepts in AI research right now seems to be "loops." So I think this is an interesting entry point. The idea is that the way models improve is no longer by getting bigger, but by thinking recursively. What does this specifically mean?
Mostafa Dehghani: This is definitely one of the most active top-tier fields that almost every lab is investing in. It operates on different levels. At the micro level, it's basically the loops we use in the architecture, or loops used at inference time for tasks like "test-time compute." At a higher level, it's basically the loops we perform on the development process of these models, which we usually call "self-improvement."
To put it simply, this is actually just a continuation of a decades-long trend. Think about classic machine learning: humans had to sit down and manually design features; you had to decide what the model should actually focus on. Then deep learning and neural networks came along and said, "Let's remove this link and let the model find the representations itself." This was a big deal back then; we somehow eliminated a huge human bottleneck and human bias. Then, we not only designed architectures but also started learning architectures. We no longer cherry-picked every training signal but expanded to data-driven approaches, letting the data speak.
Self-improvement and this loop in development are just the next step in the same direction. The core concept and significance lie in the fact that you are removing the human bottleneck and bias when improving these models. Now you not only don't need humans to manually construct features, but you also don't want humans involved every time the model needs to get better. I think this is the logic on the development side. So it's not brand new, but a new chapter of the same story. I think every time we remove human judgment from this process, we usually overcome a bottleneck. This self-improvement and development loop can be said to be doing this at the highest level (improving the model itself).
If you want to go into a more detailed loop level, we can discuss methods to increase model test-time compute, and how we make the model loop through its own processing for specific problems to refine and think. The most familiar form is the Chain of Thought, letting the model think through extra tokens. You can also think of different ideas, like letting the model increase computation for a specific problem. If I have some placeholder tokens, I can use them as a "read-write tape" to re-verify the work I've done, review the scheme or process I executed across steps, and understand where I went wrong and what to do next. There is even "negative sparsity," which means repeatedly using certain parts of the model multiple times. This new loop has also been proven very useful, mainly because it allows the model to invest more computation in difficult problems. This is self-improvement during inference.
AI Building AI Has Already Happened in the Past Few Months
Matt Turk: You just mentioned a larger concept, which used to be more like science fiction, but now seems to be rapidly becoming reality, namely "Recursive Self-Improvement" (RSI). This seems to be a topic many people are talking about, and there will be some related papers published in the coming weeks. So, what exactly is RSI as a concept?
Mostafa Dehghani: It's interesting that you call it a scenario that looks like science fiction, where the model is actually improving itself. This is indeed true, because a few years ago, if you wanted to talk about this, you could only write a forward-looking paper at a conference and talk about it at a very high level. But if we look at what is actually happening now, it has largely already happened.
Most people don't realize that this has happened in the past few months. In almost every lab, new generations of models are largely built using the previous generation of models. I think this is the case everywhere now. It's not fully automated yet, but the direction is very clear, and it's easy to imagine us entering a fully automated territory. These models will self-improve and continuously learn from the world.
This relates to other concepts, like continuous learning, although we haven't reached the state-of-the-art level yet. But if someone comes over and says, "I have an idea to let the model calculate gradients on the fly and update its weights," this would sound very normal, no longer earth-shattering news. I think what is currently missing is long-horizon and full automation, and we are speeding towards that direction. Once we achieve full automation, we can close the loop of self-improvement. By then, the problem mainly becomes providing computational resources for the models to do what they want to do. As I said before, we just got rid of the human bottleneck for improving models, and I anticipate this development will bring another huge leap.
Matt Turk: People may have seen or heard of Karpathy's "auto-research" project a few weeks ago. Is that an example of this recursive loop?
Mostafa Dehghani: Absolutely. I think that was one of the early examples of seeing models actually making smart moves on the research end. We've always seen them do well in improving the engineering part of the development loop, but in research—which you might think requires some intuition, or experienced researchers who have studied these models for a long time to do—models might not be able to do.
But we have seen signs that key parts of that "secret to success" in researcher intuition are entering the development loop through models. It's hard to say now whether this means we will soon replace every genius researcher with these models, maybe so, but this is definitely a sign. We were a bit skeptical; you know, a few years ago we couldn't believe this would happen so early. It's very exciting.
Matt Turk: I want to confirm again to make sure the audience understands: we are talking about "AI building AI." A few months ago, if you talked to researchers, they would say "we are already using AI to build AI," but that usually meant using AI tools and inference models to generate ideas. But here we are talking about AI automatically updating itself in a recursive way, updating weights, which could lead to a drastic acceleration of progress. Do you think this largely depends on us, and is mainly an issue of long-horizon and more computational resources, right?
Mostafa Dehghani: I think so. That's one aspect. On the other hand, I'm not saying we will be able to fully automate these models soon; there are still many problems to solve. But looking at the direction, I can see how this will happen. It's hard, but very possible.
The Biggest Bottleneck in the AI Self-Improvement Loop: Evaluation
Matt Turk: So what are the obstacles? You mentioned compute. Is evaluation one of them? Because models need to understand the quality of the answer, right or wrong.
Mostafa Dehghani: 100%. Ultimately, you can only improve what you can measure. And getting evaluation results is very difficult. In the end, this almost becomes a philosophical question, not just a technical one. If you have a very capable team, if there is a concrete evaluation standard, they can usually make huge progress on the problem. But without evaluation, it's really hard to push forward.
In fact, we haven't even defined evaluation criteria that can measure "how close we are to achieving the self-improvement loop." The lack of this measurement makes progress in this direction harder to quantify. Although there are some surrogate indicators, like evaluating each step the model takes in this direction, or evaluating the model's ability to help itself improve within a specific framework. The difficulty of building an evaluation system also lies in the fact that the infrastructure required to run extremely complex evaluations is also very complex.
Interestingly, sometimes we have to figure out how to create an environment where models can run safely. For example, inside Google, how to make it safely perform all the work that a research engineer or research scientist can do? Because we don't have confidence yet that they will always do the right thing. Measuring how far they can push and how long they can persist is very difficult. Connecting all these points into an environment where a model runs, and running them efficiently, while bringing diversity to evaluation, is definitely one of the bottlenecks for progress.
Methods for Continuous Self-Improvement: Drawing from Formal Verification, Staying "Grounded"
Matt Turk: A few weeks ago, we discussed "formal verification" with Karina Hong from Axiom Math. From your perspective, is this a promising field? Can formal verification ensure the improvement loop continues?
Mostafa Dehghani: In my view, formal verification is one of the most powerful keys to unlocking self-improvement, but it is not the only key. For mathematics and code logic, it is excellent. You can run a proof, and it either passes or fails. But if you enter other messier domains, you cannot judge whether a doctor's advice is good through formal proof.
Therefore, extending formal verification to all domains of the real world is not easy. But a very relevant question is, how do we learn from formal verification methods to build that tight and honest feedback loop for the messy parts of the real world. This is very inspiring, which is to build on formal verification methods to extend to those domains that are not easily verifiable. You need some kind of clear, tight feedback loop to make progress.
Matt Turk: This is like the problem reinforcement learning faces. Once you deviate from math and code, you enter a very messy domain. So is "model collapse" a concern?
Mostafa Dehghani: Model collapse is definitely a risk. I would say model collapse mainly happens when the loop is completely closed. If you have no external signals, just the model talking to itself, or running in a constrained environment, then there is a high probability the model will collapse. But if you have a strong verifier, or some kind of reality reward signal that can anchor AI-generated data, it becomes very powerful. The key here is to stay "grounded," anchored to real things, so you can mostly avoid model collapse.
Matt Turk: To make sure everyone understands, can you first define what model collapse is?
Mostafa Dehghani: Simply put, it's when the data and environment a model interacts with are designed by another model. Then you become very good at that specific part, but suddenly, you lose the ability to generalize to anything else. This is a definition or case of model collapse.
Specialized Models Are Stepping Stones to Generalized Models
Matt Turk: You mentioned losing "generalization ability." In the concept of RSI, is this a worrying issue? That you either have self-reinforcing loops but they are very narrow, or you have more general models but lose the loop advantage?
Mostafa Dehghani: This is an interesting question: generalization or specialization. In the long run, you want a model that knows everything, and knows when to go deep and when to go broad. Imagine an agent, if it's a programming agent, it's extremely strong at every step of operation, a very excellent programmer. This is great, very specialized. But for many programming problems, you need some planning, understanding the status quo, gathering information, and making decisions based on context. After you define the steps, super strong specialization will kick in. Before that, being a generalist is very useful.
Generalization is the necessary path to reach the ultimate goal of AGI. But in the short term, building expert models may be the fastest way to learn "what is truly possible." In many cases, these specialized models are becoming stepping stones to generalist models. You can imagine, if I am thinking about self-improvement, I need to ensure I can succeed in a specific area (like coding). If successful, then consider how to broaden. I often say, people don't care what category their problem belongs to. If humans call something a "problem," AI should be able to solve it. This is the fundamental requirement of a generalist. So ultimately, you need generalization. The trade-off between general and specialized is more about long-term and short-term, and how to leverage the advantages of each side in the process.
Matt Turk: What do today's specialized models look like? Is it a standalone model, or a general large model trained in a specific way through reinforcement learning (RL)?
Mostafa Dehghani: Before, we were limited by computational resources. If you wanted to push a model up, we would choose specific dimensions and allocate compute to make it an expert in that field. This is the trade-off when the compute budget is limited. As compute becomes cheaper and more accessible, we may instead be limited by data.
Another trade-off appears in the post-training phase. Sometimes it's hard to make the model perform well in all domains. You try to make it good at multimodality, and find it regresses in code; you make it good at code and multimodality, and it's slightly worse than previous models in math and reasoning. This is because post-training causes a bit of "overfitting." Post-training is essentially trying to fit it to the best local optimum you have. When the problem becomes "how to find the best local optimum," since no solution is perfect for everything, you have to make choices.
For example, some companies focus heavily on code, which is easier to achieve than competitors who want to build an all-around excellent model. In the short term, this is very effective because during development you don't have to worry about all dimensions, allowing researchers and engineers to free up energy to push one thing to the limit. Specialized models are about picking a specific axis to make the model look very excellent.
AI Begins to Self-Create, But AI Researchers Haven't Lost Their Jobs Yet
Matt Turk: The point you just made is very intriguing: people like Karpathy, or people like you, could all be automated in the future. If the world's smartest minds are automated and AI begins to self-create, what will happen? Will there be a point in time where no one knows how AI works anymore?
Mostafa Dehghani: This part is very philosophical. I don't know. Sharing a thought I had a few days ago: I have a one-and-a-half-year-old daughter. In the past few years, I've been deeply moved, and interestingly, my predictions on the timeline have been proven wrong multiple times. Sometimes I say this will happen in 6 months, and it doesn't; sometimes I feel this is too hard, absolutely impossible to solve in 10 years, and then boom, two or three months later someone has a genius idea and solves it.
It's really hard to predict the future. Speaking of researchers like Karpathy, I'm thinking about the next generation. If my daughter asks me later: What should I learn? What major do you recommend? Which scientific branch should I delve into and become an expert? I really don't have a good answer.
What I do know is that there are a few skills that may be key to influencing the world and staying competitive. One of them is strategic vision, being able to put all parameters on the table when making decisions. In the near future, being an absolute expert in a very specific subject may no longer be that useful. I think Karpathy's talent lies not in him being a good programmer (of course he is), but in his excellent global view. By placing himself in the information flow, he can decide what the next most impactful thing is. The way he generates influence now is completely different from 5 years ago. I think he can continue to do this. What will he do in 5 years? I don't know, but I know he's smart enough to figure out how to continuously impact the world. So AI researchers haven't lost their jobs yet; hopefully, we are smart enough to cope.
Data Work May Shift to "Building Environments"
Matt Turk: This is a macro question. If AI continuously self-creates, does data still matter in that equation? Or is it all about compute?
Mostafa Dehghani: The concept of "data" will be broader than "tokens." If you view data as anything the model can get signals from—whether it's next token prediction in pre-training, or super complex environments where models interact and get signals—the value of data will not disappear.
I think data work may shift to "building environments," or ensuring these models can interact with the physical world and get feedback. This becomes: how do I provide more "grounding" opportunities for these models? They are good at self-improvement, but only if I let them touch real-world data and environments. Providing data will become: how do I give this model information it has never touched?
Let me mention a somewhat sci-fi idea: how do I let AI touch "smell"? There isn't a good way yet. But for humans, because we have all our senses, acquiring information is very easy. I'm sitting here, knowing how hard the chair is, what the room temperature is. All this sensory information converges to me, and the next word I see is based on all these inputs. Providing this sensory information to self-improving models is a hard problem. So data work will shift to making this sensory information more accessible, so that models can truly improve themselves in more effective ways.
Model Research Will Still Swing Between Pre-training and Post-training
Matt Turk: The big theme of the past year has been the joint acceleration of post-training and pre-training. Where do you expect progress to come from in the coming months?
Mostafa Dehghani: It depends on when you ask this question. Clearly, we will swing back and forth between pre-training and post-training. Ultimately, pre-training is still the foundation; you can never rescue a bad base model through post-training. But currently, the return on investment for post-training is very strong. A few months ago, I started participating in Gemini's post-training (mainly in code and agent directions). I can see how a small genius idea can make the model 10 times better in behavior at a fraction of the pre-training cost.
On the other hand, in Google DeepMind (GDM), many exciting research efforts are being put into the pre-training end—new recipes, new ideas. I think what we do in pre-training will unlock many downstream possibilities. Post-training is just a different mode of operation for me, although I've just started on this part. But I always expect an alternation between the two.
Matt Turk: Your view on pre-training seems to refute the "pre-training is dead" rhetoric from a few months ago.
Mostafa Dehghani: I think everyone has ideas on pre-training. Whether to realize that idea or not depends on complexity and expected return. Sometimes you feel some fruit is easier to pick.
I have a pre-training scheme on hand that is simple, elegant, and highly scalable. I plan to push this scheme first, then shift energy to the post-training phase. At some point, the base model itself becomes the bottleneck, and then you'll be happy to adopt that complex scheme and bring it into pre-training, and then continue to push it.
As for the saying "pre-training is dead," I feel that talking about "old" and "new" is often very subtle, because the definition of time span is very subjective. So when I say "old," I might mean something from two weeks ago. But the way we did pre-training a year or two ago has indeed seen diminishing returns. However, I can see new ideas injecting new vitality into pre-training and suddenly opening a door to strange new territories, which could completely change the capabilities of base models over time.
The "Common Enemy" of Self-Improvement and Continuous Learning Is Models with Frozen Weights
Matt Turk: So, there will definitely be a lot of exciting stuff when Gemini 4 releases. You mentioned continuous learning earlier, which is also one of the hot topics people have been discussing. Can you help us define what continuous learning is? To make this conversation educational for a broader audience. Maybe contrast it with the "self-improvement loop." Although they are two different things, please help us understand the difference.
Mostafa Dehghani: They are indeed related but different. Self-improvement is about the model getting smarter over time, improving its own capabilities, done autonomously by the model. Continuous learning is mainly about how the model stays "up-to-date." Imagine a doctor who constantly reads new research results and updates their knowledge reserve, striving to ensure knowledge doesn't become outdated.
The "common enemy" of self-improvement and continuous learning is models with frozen weights. As the world turns, if your model weights are frozen and the world moves forward, then you can neither achieve self-improvement nor continuous learning. Continuous learning focuses more on ensuring that when the world generates new knowledge, the model's knowledge cutoff doesn't stay in the past. So it's continuously updated. For example, overnight, all news and changes happening in the world will be synced. So if you ask the model a question today, that very fresh knowledge already exists in the model's weights, and it doesn't need to rely on external sources to get it.
This is hard, really, really hard. One big problem is Catastrophic Forgetting. That is, when you let the model learn new information after completing the main training phase, you suddenly find it regresses on old knowledge learned in the main training phase. This is currently a very active research area.
Matt Turk: So what is the current status of continuous learning? Is it already built into existing systems, or is it not at that stage yet?
Mostafa Dehghani: This can be viewed from two aspects. On one hand, I think the research hasn't reached the level of "this is the ultimate solution, I just need to develop and push to production." Basically, every time you encounter a key new problem, you go through an exploration phase. People try different ideas, jumping from one point to another possibly completely different point. When you have confidence that this method works to some extent, you enter the "exploitation" mode, refining it and pushing it to the limit. We'll scale for this, develop infrastructure, increase speed, and achieve productionization.
I think we haven't reached that point yet. On the other hand, as I said, because we've never had a very confident continuous learning solution, investing in infrastructure and building high-speed systems in this situation is very hard. That said, I've seen very significant progress in this regard inside Google DeepMind. It's interesting because it can be very theoretical. I've seen people doing pure theoretical research get involved in this problem, having fun, and generating a lot of impact. Although the progress made is impressive, I don't think there's one idea yet that everyone universally agrees is "it, let's just do this."
The Birth of Universal Transformer: Parameter Reuse and Deep Recurrence → Test-Time Compute → Adaptive Computation
Matt Turk: Great. I want to talk about you and your background. Can you take a few minutes to tell your story? How did you start this work? What was your journey into the AI field, and how did you join Google DeepMind?
Mostafa Dehghani: I got my PhD in machine learning from the University of Amsterdam, mainly researching language models, text, and search and retrieval. As for what prompted me to really want to enter the mainstream and become one of the group striving for progress, it was because I did several internships in 2016 and 2017. Interestingly, in early 2017, I interned at Google Brain, and that experience was amazing. The team I joined was researching summarization with LSTM. Summarization was one of the most interesting problems back then. I was stunned; I thought: "This is so cool, I just want to do this for the rest of my life. This is it."
So I received an offer to return for an internship later that year. The recruiter told me there was a team that just published a paper, you might have heard of it, called Transformer, and they were recruiting interns. I remember chatting with Łukasz Kaiser. Łukasz enthusiastically described to me how to build an algorithmic machine based on Transformer. After chatting, I started messaging the recruiter saying: "I'm not sure I want to go to that team. What they're doing feels random. Everyone is doing LSTM, why would I go work with a group of people studying Transformer, this random architecture? That thing will definitely become obsolete."
As a result, he couldn't help me find another team to join, so I still joined that team as an intern. That changed my life. Working with this group of super smart, brilliant people, when almost everyone else was excited about other things, they firmly believed in their vision and direction, which was very inspiring. Later, we turned that "algorithmic machine" idea into the Universal Transformer paper, where the concepts of deep recurrence and parameter reuse were generated at that time. Nearly 10 years later, it still has huge influence.
Matt Turk: Tell us briefly about that. That was 2019, right? You are a co-author of that paper, and the idea in that paper fits very well with the loops and recursion we mentioned at the beginning of our conversation.
Mostafa Dehghani: "Universal Transformer" was written in 2018; I remember it was even rejected by a conference once. Later it was accepted in 2019, I don't remember exactly where, maybe ICLR, but it was rejected by NeurIPS or something before. The core intuition is that parameter reuse and letting the model process its own output again has some value. Basically, you generate something, then pass it back to the model again, giving the model a chance to process it again. I remember Łukasz had a dataset he called "algorithmic tasks" back then.
That was part of the TensorFlow-based codebase, called Tensor2Tensor. The code is still there; I can even find the merge request where I submitted the Universal Transformer code. We found that when dealing with some problems—like copying input to output, or dealing with algorithmic tasks with super long inputs—this is extremely difficult for ordinary models (like ordinary Transformers), performing very poorly, but can be perfectly solved through loops. I remember we used Meta's bAbI dataset, and the performance was also very good.
Then the idea of "test-time compute" appeared: you train with a fixed amount of compute, but at test time, you can release the model's potential, letting it invest more compute (FLOPs) for the input. We were very excited about this. Eventually, we introduced Adaptive Computation into it, which was actually inspired by Alex Graves' paper on LSTM. That was a very interesting journey. We were pursuing something that sounded avant-garde, but I guess the whole field was too focused on how to use adaptive computation to reduce the cost of simple problems back then.
But now we know that you can actually use it to increase the computational cost of difficult problems. This is actually two sides of the same coin. Because we were resource-constrained back then, we were always thinking: why spend so much compute running through all layers? If the end of a sentence is just a period, do we really need to run all 24 layers? How can we reduce computation? But now we have a different perspective: for example, for a physics problem, to run inference, we might be willing to let it run for two weeks. So how to increase computation?
It's really fun to work with these geniuses. This kind of deep recurrence and parameter reuse, or what some later called "Negative Sparsity," is a very good concept. This can link it well with Mixture of Experts (MoE). In MoE, you have "parameters without added compute." In loops, you have "compute without added parameters." You don't need extra parameters to invest extra compute on the same problem. This goes in the opposite direction of sparsity and is very effective. I think people are realizing this, and we see a lot of exciting progress in this direction.
The Birth of ViT Model: Slice Images, Feed to Transformer, Scale Up
Matt Turk: Fascinating. Another fundamentally important contribution you made in this field is in vision. How did the Vision Transformer change AI? In 2022, we saw that Transformer paper, titled "An Image is Worth 16x16 Words: Transformers for Large-Scale Image Recognition." Can you walk us through what that is?
Mustafa Dehghani: That also has an interesting story. I entered the vision and multimodal field through that paper. I had never researched vision problems before, mainly because I was sitting next to colleagues researching vision. My desk was right next to theirs, and in conversations with them, I became interested and thought it was really interesting. I remember I was working with Aakash and others on the paper externally called PaLM. I was thinking, why do we have 400 billion parameter language models, while the largest model in vision is only about 100 million parameters (like ResNet)? Why doesn't scaling bring returns?
Mustafa Dehghani: I started researching with colleagues: maybe there's something in Transformer that allows it to be scalable, maybe we can abandon convolution and try. Honestly, I don't want to say that's the only way to achieve scale; if a group of people spent enough time on convolution, maybe they could make it equally scalable and excellent. But adopting Transformer had another benefit: at that time, the whole machine learning field, people researching language were using this architecture, they were building infrastructure for it, making it faster. Sometimes hardware in the short term was also designed based on this architecture.
So we started pushing. I remember we had many ideas, like "what if every pixel is a Token?" but that would be too costly, the context sequence would become super long. We went through repeated discussions. Interestingly, we initially thought about this problem from very complex angles. We tried to simulate convolution to make it work, but the result was, some colleagues of mine in Zurich tried a simple idea: what if we directly divide the image into 16x16 pixel blocks (Patches)? Treat each patch as a Token, forget about overlapping patches or windows or complex designs.
Just like that, chop up the image, feed to Transformer, then start stacking scale—train this model with lots of data, starting from discriminative tasks. It worked. This surprised us all a bit because what we thought about before were those fancy and complex integrated convolution schemes, but what actually worked was this simple idea: slice, feed to Transformer, scale up. Boom! An excellent representation learning model was born.
Matt Turk: To recap at the highest level, this basically means you can apply the Transformer architecture to images. In the past, these were two different families: text belonged to the Transformer world, images belonged to the CNN (Convolutional Neural Network) world. Your breakthrough proved that Transformer can also scale well to images, which basically paved the way for today's Gemini 3, because it's a natively multimodal model. Is that fair to say?
Mustafa Dehghani: Yes, exactly. Based on this, we took a step and let video and audio also start adopting Transformer. Even if it's not the only architecture to achieve multimodality, it made natively training these models very simple, because you can use a single architecture to include all modalities during training.
What Excites Mustafa Most About Native Multimodality: Glimpsing Transfer Between Modalities
Matt Turk: This perfectly transitions to your work on the Nano Banana team and the future of image AI. You are a member of the Nano Banana team, and the product went viral after release, which must have been fun. After that, there were several releases: Nano Banana Pro in November 2025, and Nano Banana 2 released a few weeks ago, which is Gemini 3.1 Flash Image. Many people think image generation is like a translator: AI reads the prompt, translates it into painting instructions, and then paints. But as we said, Gemini is natively multimodal. So how does it work? How does the model simultaneously handle text and pixels to build an image?
Mustafa Dehghani: The reason I entered the generation field... by the way, I'm not an expert in image generation. When I first started working, I had meetings with others, and they talked about computer graphics and various old-school intuitions, and I couldn't understand what they were saying at all. I only knew how to train Transformers and scale up; if that helps, I can contribute.
Working with these incredibly smart people is very interesting. The reason I'm excited is because I'm interested in "positive transfer" between modalities. When you think about native multimodality, on one hand, it's adding capabilities: the model can understand images, videos, audio, and text, and can also generate all these modalities. From a product perspective, this is great. But for me, the most exciting part is whether we can glimpse the transfer between modalities.
For example, if I train a model to be good at generating images, will it become better at generating text? There's an old concept in linguistics literature called "Reporting Biases." For instance, you go to a friend's house and see a banana-shaped sofa. When you go home, the probability of you talking about this sofa is much higher than talking about a normal sofa. You'll tell your friend: "I went there, their sofa is banana-shaped, so interesting." But if the sofa is ordinary, it would be strange for you to specifically mention it.
This is language reporting bias: language doesn't discuss things at the center of the distribution (mundane things). But if you have image input, this information is there, doesn't need to be "reported." Therefore, acquiring world knowledge through language is not efficient. I'm not saying it's impossible, but inefficient. For example, learning about gravity, if you let the model watch videos, it's much easier than reading all textbooks to understand the concept of gravity.
Introducing Multimodality Is a Shortcut to Making Models World Models
Matt Turk: Is this the concept of "world models" built into image representation?
Mustafa Dehghani: Exactly. You want these models to become world models, to understand this world. Although teaching models through text is possible, introducing multimodality is a shortcut to achieve this. And the best way to learn a modality is to learn how to generate it. Gemini has been multimodal from day one. The reason we released image generation in version 2.5 instead of 1 or 2 was because it wasn't good enough before and needed a push. Later, we found a way to introduce native generation without regressing the model's other capabilities.
This is one point I'm very interested in. But unfortunately, seeing obvious "positive transfer" is very difficult. Although the trained model is excellent, it's hard to intuitively see "I trained on images, and the text perplexity dropped." But my hope is that multimodal training will eventually achieve positive transfer across modalities.
Those experts have excellent taste in visual quality. Sometimes I think the model is great, send it to them, and they say no. They can point out subtle differences in two pictures that look the same to me. It's this intuition of theirs that made Nano Banana. But I was thinking, what if we push this beyond traditional image generation?
Rather than making a "text-to-image" translator, make a "thinking machine" about images. For example, achieving interleaved text-image generation, the model can not only think with text Tokens but also think in pixel space. It generates a paragraph of text, then generates an image, then generates text. This can be used for storytelling, like children's books.
Another thing that excites me is "incremental generation." If you ask DALL-E or Imagine, these standalone models, to generate a scene with 50 details, they might fail. You can train a model that can handle 55 details, but at 60 it will have problems. Single-shot generation always has a bottleneck. But with incremental generation, the model can generate details one by one. You no longer expect the model to draw the perfect image in the first shot, but expect it to plan. It will think: "Let me start with large objects first, because if I put small objects first, large objects might not fit." This planning avoids the performance bottleneck of single-shot generation.
Behind Nano Banana 2's Fast Generation: Lightweight Models, Distillation Research, Inference Optimization
Matt Turk: Does this help improve efficiency? Especially Nano Banana 2, it has Flash features and generates extremely fast. Who are the unsung heroes?
Mustafa Dehghani: First, I participated in the initial Nano Banana and Pro versions; the last version was delivered by the team because I shifted to post-training and agents. From a high level, the reasons making the model faster and more efficient are partly the model size (Flash version has lighter parameter configuration). Another part is that people spent a lot of time researching distillation schemes, making complex processes lighter through distillation.
Surprisingly, the infrastructure work for inference serving is also very important. We have very genius inference engineers. Sometimes you sit at your desk, and they walk over and casually say: "By the way, I just sped up the model by 10 times." You'll find this incredible. These models operate differently from ordinary language models, and excellent engineers can specifically optimize the inference pipeline.
Critical Views on the AI Field: Continuous Learning Is Underrated, AI Technology Progress Speed Outpaces World Supporting Mechanisms
Matt Turk: Near the end of the conversation, let's talk about a few sharp points. What are some current practices in the AI field that are wrong?
Mustafa Dehghani: It's hard to point to just one thing, but this is just my personal opinion. I think we underestimate how hard it is to fix "Jagged Intelligence". People see a model that can solve complex math problems but can't count the letters in a word, usually just laugh it off. But I think this points to some deep and unsolved problems in how these systems represent and process knowledge. This isn't just a patchable bug, but a structural feature of how the model learns.
Matt Turk: What are some underrated ideas in AI research right now?
Mustafa Dehghani: Continuous learning. As I said, problems often stay in the exploration phase until we have confidence to enter the exploitation phase. I think now is the time we must push it to production. Current foundation models are essentially "frozen" in time; training ends, and it's fixed. All RAG pipelines, fine-tuning processes, retrieval systems are built on the assumption that "the model is frozen." This assumption is too strong; we need to think more aggressively about how to change it.
Matt Turk: Do you think RAG will disappear over time?
Mustafa Dehghani: It won't be like it is today, but I'm not sure it will completely disappear. RAG isn't just about bringing fresh information, but also in-context learning. There's a difference between knowledge in model weights and knowledge in context. Maybe it won't need to trigger RAG for everything, but for long-tail distributed information, we'll still use RAG.
Matt Turk: What do you think people are overconfident about?
Mustafa Dehghani: People think that just pushing the technology side is enough, that as long as the model is smarter, everything else will fall into place. In my view, if a version of AI is great on technical issues but has blind spots in other areas, it cannot create meaningful world progress. Governance, regulation, social trust, distribution of access rights, absorption capacity of institutions—these are not solved problems, and are even harder than the technical part. Currently, the speed of technological progress is clearly running ahead of the world's ability to develop supporting mechanisms, and this gap is widening.
Directions Mustafa Is Currently Interested In: Super Long-Horizon Tasks, "Grounding" Problem, Definition of Intelligence
Matt Turk: One last question. If you started from scratch today, what would you research?
Mustafa Dehghani: I don't want to start from scratch (laughs), that's too hard. But I can tell you about a direction I'm very excited about, and that is full automation of super long-horizon tasks. Current agent demos have a market, but people don't talk enough about the "compound reliability problem."
Imagine, if an agent needs 100 consecutive steps to complete a task, assuming the success rate of each step is 95% (which is already very optimistic), then the probability of completing the entire task without error is:
This mathematical logic is brutal. Long-horizon automation requires extremely high single-step reliability and error recovery capabilities, which current systems do not yet possess. People experience not the average performance of the model, but its failures. One stupid mistake does more harm to trust than the benefit of doing 100 things right.
Additionally, I would also study the "Grounding" problem and how to build stable AI systems connected to the physical world. We must get rid of statistical patterns in text and pixels. Finally, even defining "intelligence" itself is a problem with practical significance. We've been pursuing making models smarter, but the definition of intelligence is very vague, making it hard to measure true progress. We need a systematic way to define intelligence, clarify goals, and then go full speed ahead.
Matt Turk: Mustafa, this was a brilliant conversation. Thank you very much for your time.
Mustafa Dehghani: Thanks for the invitation, it was fun chatting.
Reference link:
https://www.youtube.com/watch?v=Bo19sXssYXI