Latest Interview with Jeff Dean, the Soul of Gemini and Legendary Engineer: In the Future, Everyone Will Have 50 Virtual Interns, No Need for Experts Anymore!

"A resume is basically a timeline of AI," is how many describe Jeff Dean, the core driving force behind Gemini and Google's Chief AI Scientist. From rewriting the full-stack of Google Search in the early 2000s to rebooting trillion-parameter sparse models, and co-designing TPUs with cutting-edge machine learning research, Jeff Dean has quietly shaped almost every layer of the modern AI technology stack. He has witnessed multiple scale revolutions: from CPUs and sharded indexes to multimodal models capable of reasoning across text, video, and code.

Recently, his sharp remarks during an in-depth conversation have sparked intense discussion. Many industry insiders exclaimed, "The information density is huge." In this interview, Dean offered numerous exclusive insights and highly forward-looking judgments.

"The era of unified models has truly arrived. The key is that models are becoming increasingly powerful, eliminating the need for domain experts." He stated that the future lies in a combination of specialized and modular models, allowing simultaneous access to and invocation of 200 languages, super-powered robotics modules, and super-powered medical modules depending on the scenario. "Model knowledge is installable, just like downloading software packages."

As "one of the most prolific engineers in computer history," Dean also generously shared his current methods for writing code with AI, stating, "In the future, it is very likely that everyone will own 50 virtual interns. You can organize them into groups; you only need to interface with 5 groups and let them handle the work."

Furthermore, Dean detailed Google's internal "pushing the frontier" model and the thinking behind driving team architecture improvements and model capability upgrades. Beyond this, he proposed and deconstructed several intriguing questions: Why distillation is the core driver behind every Flash model breakthrough; why energy consumption, rather than compute power, is becoming the true bottleneck; how to co-design hardware and models 2–6 years in advance; and why the next leap will come not just from larger context windows but from systems that can "appear to be processing trillions of tokens."

Below is the detailed conversation content, translated and condensed without altering the original meaning, for our readers.

1. Next-Generation Models: Which Old Ideas Are Worth Reviving?

Shawn Wang: Today we have Google's Chief AI Scientist, Jeff Dean. Welcome. It is truly an honor to invite you. I have watched countless speeches of yours; your career is legendary. First and foremost, congratulations on securing the "Pareto Frontier."

Jeff Dean: Thank you. The Pareto Frontier is indeed excellent; it is great to stand in this position.

Shawn Wang: Yes, I think it involves both. You need to occupy the Pareto Frontier with top-tier capabilities while balancing efficiency and providing a range of models people are willing to use. Part of this stems from your hardware work, part from model work, and certainly many accumulated secret sauces. Seeing all of this integrated so seamlessly is truly shocking.

Jeff Dean: Yes, exactly. As you said, it is not a single factor but a combination of the entire technology stack from top to bottom. All of these together enable Google to build extremely capable large models, while also using software techniques to migrate the capabilities of large models into smaller, lighter models. These smaller models have lower costs and lower latency but remain highly capable within their own scale.

Alessio Fanelli: Regarding holding the lower bound of the Pareto Frontier, how much pressure do you feel? I feel many new labs are desperately pushing the performance upper limit because they need funding, etc. But you have billions of users. I remember you discussed early on when doing CPU work: if every Google user used a voice model for three minutes a day, you would have to double the number of CPUs. How is this discussed internally at Google now? How do you balance "pushing the frontier" and "must deploy"?

Jeff Dean: We always hope to have frontier models, or even push the frontier, because only then can we see new capabilities that did not exist last year or six months ago. But at the same time, we know that while these top-tier models are useful, for many broader scenarios, they are too slow and too expensive. So our approach is to pursue two lines simultaneously: one is high-capability, low-cost models supporting low-latency scenarios, making it easier for everyone to use them for tasks like agent programming; the other is high-end frontier models for deep reasoning and solving complex mathematical problems. The two are not mutually exclusive; both are useful. Moreover, through distillation, a key technology, you must have a frontier model first to distill capabilities into smaller models. So it is not either/or, but complementary.

Alessio Fanelli: You and Jeffrey proposed related solutions back in 2014.

Jeff Dean: Don't forget the L'Oreal Vinyls paper as well.

Alessio Fanelli: That was a long time ago. I am curious, how do you view the iteration cycle of these ideas? For instance, ideas like sparse models, how would you re-evaluate them? In the next generation of models, which old ideas are worth picking up again? You have been involved in many ideas that later had a huge impact, but at the time, it might not have been obvious.

Jeff Dean: The original starting point for distillation was that we had a very large image dataset, 300 million images. We found that if we trained specialized models for different image categories—for example, one focusing on mammals, another on indoor scenes—pre-training on broader images first, and then fine-tuning on clustered categories with augmented data, the results would be much better. But treating these 50 models as a large ensemble model was not practical for actual deployment. Thus, the idea of distillation emerged: "compress" these independent expert models into a form that can be practically deployed. This is essentially similar to what we do today, except now we no longer use an ensemble of 50 models, but first train a super-large model and then distill it into a much smaller model.

Shawn Wang: I was also wondering if distillation is related to innovations in reinforcement learning? Let me try to express this: Reinforcement learning allows models to leap forward in certain parts of the distribution but may suffer losses in other areas; it is a somewhat unbalanced technique. But perhaps distillation can "bring it back." The general expectation is: improve capabilities without regressing elsewhere. This lossless capability fusion, I feel, should be partially achievable through distillation, but I haven't fully figured it out, and there aren't many related papers yet.

Jeff Dean: I think a core advantage of distillation is: you can use a very small model, paired with a super-large dataset, traversing the data multiple times, to obtain logical probability outputs from a super-large model, guiding the small model to learn behaviors that cannot be learned using hard labels alone. We observed that distillation allows small models to approach the performance of large models. For many people, this is the optimal balance point. Now that Gemini has gone through several generations, we can make the new Flash versions achieve or even significantly surpass the effects of the previous Pro versions. We will continue to do this because it is a very healthy direction.

Shawn Wang: Dara previously asked: The earliest roadmap was Flash, Pro, Ultra. Have you always been using Ultra as the "mother model," distilling from it? Is Ultra that ultimate source?

Jeff Dean: We have many types of models; some are internal models, not released or deployed externally; some are Pro-level models, from which we can also distill Flash-level models. This capability is very important, and dynamic scaling during inference can also improve model performance.

Shawn Wang: Understood. And obviously, the cost advantage of Flash brings absolute dominance. The latest data seems to be 50 trillion tokens, I can't remember clearly, anyway, it changes every day.

Jeff Dean: Yes, hopefully market share is also moving up.

Shawn Wang: I mean from a cost perspective, Flash is so economical, it can be used in almost any scenario. It is in Gmail now, in YouTube, everywhere.

Jeff Dean: We are also using it in more and more search products, including various AI modes.

Shawn Wang: Oh my god, Flash has entered AI search mode? I didn't even think of that.

Jeff Dean: One of the major advantages of Flash models is not just lower cost, but also lower latency. Latency is actually very critical because in the future, we will have models do more complex things and generate more tokens. For example, instead of just asking it to write a loop, you ask it to write a whole software package. Being able to complete these with low latency is particularly important. Flash is one path, and our hardware platforms also support many service capabilities, such as TPUs, where inter-chip connectivity performance is extremely high, very suitable for long-context attention, sparse expert models, and similar technologies. These are crucial for scaled deployment.

Alessio Fanelli: Then, regarding distillation from Pro to Flash, will there be a tipping point, roughly lagging by one generation? I have a feeling: for many tasks, Pro is already saturated today; by the next generation, the same tasks will be saturated at Flash's price point. After two more generations, Flash will be able to do almost everything everyone needs. So when most users are satisfied with Flash, how do you convince the internal team to continue investing in pushing Pro's frontier? I am curious how you view this.

Jeff Dean: If the distribution of user needs were static, that would indeed be the case. But reality is often: the stronger the model, the higher people's expectations for it. I have experienced this myself: a year ago, I used models to write code; simple tasks were fine, complex ones were not; now we have made huge progress on complex code, so I let it do even harder things. It is not just programming; now you might ask it to analyze global renewable energy deployment and write a solar energy report. These are complex tasks no one would have asked a model to do a year ago. So you still need stronger models to expand boundaries, while also helping us find bottlenecks: where it fails, how to improve, making the next generation stronger.

2. "Include the Entire Internet in Context," Letting Models Process Trillions of Tokens

Alessio Fanelli: Do you use some exclusive benchmarks or test sets internally? Because every time what is published are those few benchmarks, rising from 97% to 99%. How do you drive the team internally: what is the goal we truly need to achieve?

Jeff Dean: Public benchmarks have their value, but their lifecycle is limited. When they first come out, they are difficult; models only have 10%–30% accuracy, and you can optimize all the way to 80%–90%. But once it reaches around 95%, the marginal returns are extremely low; either the capability has met the standard, or leakage or similar content has appeared in the training data. So we have a batch of non-public internal benchmarks, ensuring there is absolutely none in the training data, representing capabilities the model currently does not possess but we hope it will have. Then we evaluate: do we need more specialized data? Or architectural improvements? Or model capability upgrades? How can we do better?

Shawn Wang: Can you give an example: a benchmark that directly inspired an architectural improvement? I happen to want to follow up on what you just said.

Jeff Dean: I think the long-context capability first introduced by the Gemini model, especially 1.5, came from this. That was our goal at the time.

Shawn Wang: At that time, everyone rushed in, and all the charts were green. I was thinking: how did everyone break through at the same time?

Jeff Dean: Benchmarks like Stack Benchmark were already saturated at context lengths of 1k, 2k, and 8k. What we are truly pushing is the frontier of 1 million and 2 million contexts, because that is where the real value lies: you can put thousands of pages of text or hours of video into the context and actually use them. "Needle in a haystack" searches are already saturated; we need more complex "multi-needle searches" or more realistic long-context understanding and generation tasks to measure the capabilities users truly need, not just "can you find this product ID."

Shawn Wang: Essentially it is retrieval, retrieval in machine learning. I want to speak from a lower level: you see a benchmark, realize you need to change a certain architecture to solve it, but should you really change it? Sometimes this is just an inductive bias. As Jason Wei, who once worked at Google, said: you might win in the short term, but in the long term, it may not scale, or you might even have to tear it down and redo it later.

Jeff Dean: I don't get too hung up on exactly what solution to use, but rather think clearly first: what capabilities do we actually need? We are very certain that long context is useful, but the current length is far from enough. What you really want is, when answering questions, to be able to include the entire internet in the context, right? But this cannot be achieved by simply scaling up existing solutions; the algorithmic complexity is quadratic. One million tokens is already the limit of existing solutions; you cannot achieve a billion, let alone a trillion tokens. But if you can create an effect where "the model can attend to trillions of tokens," that would be amazing, and the application scenarios would explode.

This is equivalent to being able to treat the entire internet as context, processing all pixels of YouTube videos, as well as the deep representations we can extract; not just a single video, but massive amounts of video. At the personal Gemini level, as long as you authorize it, the model can also associate all your personal status: your emails, photos, documents, flight information, etc. I think this will be very, very useful. The question is, how to make the model meaningfully process trillions of tokens through algorithmic improvements and system-level optimizations.

Shawn Wang: By the way, I calculated a bill before: if a person speaks non-stop for 8 hours a day, they generate at most about 100,000 tokens a day. This amount can be fully accommodated now.

Jeff Dean: Correct. But if you want to understand the content of videos uploaded by everyone, that is a completely different magnitude.

Shawn Wang: And there is another classic example: once you jump out of text and enter fields with extremely high information density like proteins, the data volume explodes.

Jeff Dean: Gemini has insisted on being multimodal from the start. For many people, multimodal means text, images, video, audio—modalities familiar to humans. But I believe it is also very important for Gemini to understand non-human modalities. For example, LiDAR data from Waymo autonomous driving cars, robot sensor data, and various medical modalities: X-rays, MRIs, medical imaging, genomic information, etc. There may be hundreds of data modalities in the world; we must at least let the model know: this is a meaningful and valuable modality. Even if you do not train all LiDAR or MRI data into pre-training, adding even a small part is very useful, allowing the model to have a basic concept of such information.

Shawn Wang: Taking this opportunity, I want to ask a question I have always wanted to ask you: Is there a "King Modality," a modality that can encompass all other modalities? Take a simple example: vision can encode text at the pixel level; that DeepSeek OCR paper proved this point. And vision can also handle audio because it can be converted into spectrograms, which are essentially visual tasks. Speaking of which, is vision the King Modality?

Jeff Dean: Vision and dynamic temporal understanding are very important. Here, dynamic refers to video, not static images. There is a reason why eyes evolved independently 23 times through evolution; the ability to perceive the surrounding world is too critical, and this is exactly the capability we hope these models possess. Models need to be able to interpret what we see and pay attention to, and help us utilize this information to do things.

Shawn Wang: Speaking of dynamic understanding, I must praise one thing: Gemini is currently the only model on the market that natively supports video understanding. I often use it to watch YouTube.

Jeff Dean: Actually, many people have not truly realized the capabilities of Gemini models. I gave an example in a speech: give the model a YouTube compilation of 18 classic sports moments from the past 20 years, including Jordan's finals buzzer-beater, soccer goals, etc. You directly throw the video at it and say: "Help me make a table listing all events, occurrence times, and brief descriptions."

It can really extract information directly from the video and generate an 18-row table. Most people never think that a model can directly convert video into a structured table.

Alessio Fanelli: You just mentioned "including the entire internet in context." Google itself needed to do search ranking because humans cannot process all information on the web. The logic for large models is completely different: humans might only look at the first five or six search results, but for large models, should we give them 20 highly relevant pieces of content? How does Google think internally: how to create an AI mode that is broader and covers more than traditional human search?

Jeff Dean: Even before large models appeared, our ranking systems did this: there are massive web pages in the index, most of which are irrelevant. First, use lightweight methods to filter out a batch of relevant ones, for example, narrowing it down to 30,000 documents, then step by step use more complex algorithms and finer signals to re-rank, finally showing the user only about 10 results. The idea for large model systems will not differ much. It seems like you need to process trillions of tokens, but the actual process is: first filter out about 30,000 documents, roughly 30 million useful tokens; then carefully select 117 documents truly worth attention from them to complete the user's task.

You can imagine this system: first use lightweight models with high concurrency to filter out the initial 30,000 candidates; then use a slightly stronger model to narrow 30,000 down to 117; finally, use the strongest model to deeply understand these 117 pieces of content. Only such a system can create the effect that "the model can process trillions of tokens," just as Google Search is indeed searching the entire web, but ultimately only gives you the most relevant small portion.

Shawn Wang: I often tell people who don't understand Google's search history that Bert was used directly in search as soon as it came out, and the performance improvement was very obvious. This must be the core data for Google.

Jeff Dean: The text representation brought by large models allowed us to jump out of the hard limit of "keywords must exactly match web pages," truly achieving topic and semantic relevance rather than literal correspondence.

Shawn Wang: I think many people haven't realized that large models have already taken over ultra-high-traffic systems like Google and YouTube. YouTube has a semantic tagging mechanism where each token corresponds to a video, predicting videos with a codebook. Given YouTube's scale, this is too exaggerated.

Jeff Dean: Recently Grok has also been used in explainable AI. Actually, before large models were widely used for search, we had been weakening the idea that "whatever the user inputs must match exactly."

Shawn Wang: Have you sorted out the evolution journey along this path?

Jeff Dean: In 2009, I gave a speech at a web search and data mining conference, talking about the five or six generations of architectural evolution of Google Search and retrieval systems from around 1999 to 2004 or 2005. We never formally published a paper on that content. A key thing happened in 2001: we scaled the system in multiple dimensions. First, make the index larger to cover more web pages; quality naturally improves, as you can never search for pages not in the index. Second, scale service capabilities because traffic surged. We used a sharded architecture; if the index gets bigger, add shards, for example, from 30 shards to 60 shards, to control latency. If traffic gets bigger, increase replicas.

Later we did a calculation: a data center has 60 shards, each shard has 20 replicas, totaling 1,200 machines with hard drives. The memory of these machines added up was just enough to put the entire index into memory. So in 2001, we directly put the full index into memory, and performance took off. Before this, you had to be very careful because every query term had to trigger a disk seek on 60 shards; the larger the index, the lower the efficiency. But after the full memory index, even if a user only inputs three or four words, expanding them into 50 related words is no problem. You can add synonyms, searching for restaurant, restaurants, cafe, bistro all together. We finally began to understand word meanings rather than stubbornly sticking to the literal form of user input.

That was 2001, long before large models, but the idea was already there: relax strict literal matching and move closer to semantic understanding.

3. "Before Writing Lots of Code, Rehearse the Design Space in Your Head First"

Alessio Fanelli: What are your principles for designing systems? Especially in 2001, when the internet scale doubled or tripled every year, and now large models also jump significantly in scale and capability every year. Do you have any consistent design principles?

Jeff Dean: First, when designing a system, you must first grasp the most critical design parameters: how many queries per second must it handle? How big is the internet? How big should the index be? How much information to store per document? How to retrieve? Can it still handle it if traffic grows two or three times more? A very important design principle of mine is: design the system to be scalable by 5 to 10 times, but not more. Because once it becomes 100 times the scale, the entire design space will be completely different, and originally reasonable solutions will become obsolete. For example, moving from disk index to memory index only became feasible after traffic and machines were sufficient, suddenly opening up a whole new architecture.

I really like to rehearse the design space in my head before writing a lot of code. Going back to Google's early days, we were not only frantically expanding the index; the index update frequency was the most exaggerated indicator of change. It used to be updated once a month, later we achieved updating a single page within a minute.

Shawn Wang: This is the core competitiveness, right?

Jeff Dean: Exactly. For news-related queries, if your index is still from last month, it is completely useless.

Shawn Wang: News is a special scenario; couldn't you split it into an independent system at that time?

Jeff Dean: We did launch Google News, but when users enter news-related keywords in the main search, they must also get the latest results.

Shawn Wang: So you also need to classify pages, judging which pages should be updated frequently and at what frequency.

Jeff Dean: Behind this is a whole set of systems used to decide the update frequency and importance of pages. Some pages, although having a low probability of change, will still be re-crawled very frequently as long as the update value is extremely high.

Shawn Wang: Speaking of latency and storage, I must mention one of your classic works: "Latency Numbers Every Programmer Should Know." Is there any story behind it? Just casually organized?

Jeff Dean: It listed about eight or nine types, ten or so indicators: cache miss overhead, branch misprediction overhead, memory access overhead, time to send a packet from the US to the Netherlands, etc.

Shawn Wang: By the way, why the Netherlands? Is it because of Chrome?

Jeff Dean: We had a data center in the Netherlands at the time. Actually, this comes back to rapid estimation. These are the most basic indicators; you can use them to make judgments: for example, if I want to do image search and generate thumbnails, should I pre-calculate thumbnails or generate them in real-time from large images? How much bandwidth is needed? How many disk seeks will occur? As long as you have these basic numbers in hand, you can do a rehearsal in your head in a few dozen seconds. When you write software using more advanced libraries, you also need to cultivate the same intuition: for example, roughly how long does it take to query data once in a certain structure.

Shawn Wang: This is just simple byte conversion, nothing special. I am thinking, if you were to update that article...

Jeff Dean: I think it is very necessary to calculate the computation amount in models, whether for training or inference.

Jeff Dean: A good perspective is: how much state do you need to move from memory? On-chip SRAM, accelerator HBM, DRAM, or network transmission? Then compare the cost of data movement with the cost of one actual multiplication operation in the matrix multiplication unit. Actually, the computation cost is very, very low; depending on precision, it is roughly less than 1 pJ.

Shawn Wang: Oh, I see, you are measuring it in terms of energy consumption.

Jeff Dean: Yes, the core is energy consumption, and how to build the most energy-efficient system. On the same chip, just transmitting from one side of the SRAM to the other, energy consumption can reach 1,000 pJ. This is why accelerators must use batching. If you move a model parameter from on-chip SRAM to the multiplication unit, it costs 1,000 pJ, then you had better reuse this parameter many times. This is the significance of the batch dimension. Setting batch to 256 is okay, but if it is 1, it is very uneconomical.

Shawn Wang: Yes, exactly.

Jeff Dean: Because you spent 1,000 pJ just to do one 1 pJ multiplication.

Shawn Wang: I have never heard batching explained from an energy consumption perspective.

Jeff Dean: This is the fundamental reason why everyone uses batch. Theoretically, batch=1 has perfect latency, but the waste in energy consumption and computational efficiency is too huge.

Shawn Wang: Latency is the best.

Jeff Dean: Yes, but the cost is too high.

4. TPU's God-Level Decision: Adjusting Model Architecture in Reverse

Shawn Wang: Are there any god-level tricks similar to "putting the entire index into memory" back then? For example, NVIDIA bet on SRAM for Grok this time, causing a huge sensation. I am wondering, did you see this point early when making TPUs? After all, to support your scale, you must have anticipated it. From these phenomena, what hardware innovations or insights have you summarized?

Jeff Dean: TPUs have a very regular structure, 2D or 3D grids, with many chips connected together, each hanging HBM.

When deploying certain models, getting data from HBM costs much more in terms of cost and latency than getting it from on-chip SRAM. So if the model is small enough, you can use model parallelism to scatter it across many chips, and throughput and latency will improve significantly. Scattering a small-to-medium model across 16 or 64 chips, if it can all fit into SRAM, the improvement will be huge. This is not unexpected, but it is indeed a good trick.

Alessio Fanelli: What about TPU design? How do you decide the direction of improvement? For example, is there a way to reduce 1,000 pJ to 50? Is it worth designing a new chip specifically for this? The most extreme case is someone saying to burn the model directly into an ASIC. The field changes so fast; how many things are worth solving with hardware? How is this discussed internally?

Jeff Dean: There is extensive collaboration between our TPU chip design architecture team and senior modeling experts. Because you need co-design: define what the next generation TPU should look like based on the future direction of machine learning research. People doing ML hardware know that starting to design a chip today might mean it enters the data center in two years and is used for another three to four years. You must predict what machine learning computations people will want to run in the next 2 to 6 years. Therefore, a group of people must research: which ideas will work and be more important during that period. Only then can we add useful hardware features to future generations of TPUs.

Shawn Wang: Is the chip iteration cycle two generations later?

Jeff Dean: About that. Small changes can be squeezed into the next generation; large changes must start design much earlier. We do this whenever conditions allow. Sometimes we add some exploratory functions that do not take up much chip area; if they succeed, they can be directly 10 times faster; even if they fail, it only wastes a little area, which is not a big problem. But if it is a particularly large change, we will be very cautious and conduct a lot of experiments to confirm the direction is correct.

Alessio Fanelli: Are there reverse situations: because the chip design is already set, the model architecture cannot go that way because it does not match?

Jeff Dean: Definitely. You will adjust the model architecture in reverse to make training and inference more efficient on that generation of chips. The two influence each other. For example, if a future generation of chips supports lower precision, you can even train with that precision in advance, even if the current generation does not fully support it yet.

Shawn Wang: How low can precision actually be pressed?

Jeff Dean: Many people are talking about ternary precision. I personally strongly support extremely low precision because it can save a huge amount of energy. Energy is calculated per bit transmitted; reducing the number of bits is the most direct way. The industry has already achieved many results in extremely low bit precision; paired with a set of weight scaling factors, the effect is very stable.

Shawn Wang: Interesting, low precision but with scaled weights. I never thought of this point before.

Shawn Wang: Speaking of this, I feel the concept of precision itself is very strange in sampling scenarios. We pile up so many super-computing chips, but finally, we have to hang a random number generator in front. Now there is a trend in the industry towards energy-based models and energy-oriented processors; you have obviously thought about this, can you share your views?

Jeff Dean: Indeed, there are several interesting directions. Energy-based models are one; diffusion models that do not decode token by token in order are another. There is also speculative decoding, which is equivalent to a very small draft batch; predict 8 tokens first, effectively expanding the batch size by 8 times, and finally accept 5 to 6 of them. Spread out this way, the cost of moving weights to the multiplication unit is diluted, bringing a several-fold improvement. These are all very good tricks. And you must look at it from the perspectives of real energy consumption, latency, and throughput; only then will you find the correct direction: either serving larger models, or achieving lower cost and lower latency for the same model.

Shawn Wang: This idea is theoretically very attractive, just not yet truly mainstream. But in a sense, it is quite aesthetic; if designed well from the hardware bottom layer, we would not need so many tricks.

Jeff Dean: There are also more frontier directions, such as analog computing substrates instead of digital circuits. Theoretically, energy efficiency could be extremely high, but the problem is you have to interface with digital systems, and the digital-to-analog and analog-to-digital conversion parts will eat up most of the energy efficiency advantages. But even looking only at the digital direction, relying on more specialized and efficient hardware, we still have huge room for improvement in energy efficiency.

5. The Era of Unified Models Has Arrived, No Need for Experts?

Alessio Fanelli: What other interesting research directions have you seen? Or are there directions you hope other researchers will try but cannot do temporarily at Google?

Jeff Dean: Our research layout is already very broad. There are many open questions: How to make models more reliable, able to do longer, more complex things containing many sub-tasks? How to achieve models calling other models as tools, combining them to complete work far more meaningful than a single model? This part is very interesting. Also, how to make reinforcement learning effective in unverifiable domains? This is a great open question. If we can replicate the progress made in mathematics and code to other fields that are not so easy to verify, model capabilities will take another big leap.

Alessio Fanelli: Previously Noam Brown came on the show and said they have proven this with deep reasoning. In a sense, your AI mode is also unverifiable. I am wondering if there are common clues here? For example, both are doing information retrieval and returning JSON. Is retrieval that part that can be scored and verified? How do you understand this problem?

Jeff Dean: You can use other models to evaluate the results of the first model, or even do retrieval. For example, let another model judge: is the retrieved content relevant? Which are the 50 most relevant out of 2,000? Such methods are actually very effective. It can even be the same model, just changing the prompt from "retrieval system" to "critic."

Shawn Wang: I always feel there is a very obvious threshold: it seems simple things are all done, and what remains is particularly difficult. Actually, everyone feels this way every year. Especially in the RLVR area, everyone is asking: what is the next stage for unverifiable problems? Then everyone says: I don't know, waiting for judgment.

Jeff Dean: The good thing about this field is that there are countless smart people thinking of creative solutions to these difficult problems. Everyone sees clearly: models are strong in some things but fail in edge cases. Proposing tricks, verifying effects, and driving progress is the core of research in this field. Think about two years ago; we struggled even with GSM8K, such elementary school math problems. Now what? Models can solve International Math Olympiad and Erdős-level problems purely through language. The leap in capabilities in a year and a half is astonishing. We have not yet fully seen the path in other fields, but some have seen the dawn, and we will go all out to replicate this leap there.

Shawn Wang: Exactly.

Alessio Fanelli: For example, YouTube thumbnail generation; this feature would be very practical; we need it too much. This is simply an AGI-level demand.

Shawn Wang: Absolutely for content creators.

Jeff Dean: I am not a YouTube creator, so I am not that sensitive to this problem, but I know many people care about it.

Shawn Wang: Indeed, everyone values it highly. After all, people really "judge a video by its cover." Returning to the Math Olympiad topic, I still find it incredible: a year ago we were working on specialized systems like AlphaProof and AlphaGeometry; this year, directly "forget it, just throw it all into Gemini." How do you view this? In the past, it was generally believed that symbolic systems and large models must be combined, but later everyone directly chose: solve everything with large models.

Jeff Dean: I think this is very reasonable. Humans do manipulate symbols, but we probably do not have a clear symbolic system in our brains; rather, it is some kind of distributed representation, essentially close to neural networks. A large number of neurons produce activation patterns under specific circumstances, allowing us to reason, plan, do chain-of-thought, and find another path if one does not work. In many ways, neural network-based models are actually simulating what happens in our brains intuitively. So for me, separating completely discrete, independent symbolic systems from another completely different thinking mechanism has never made much sense.

Shawn Wang: Interesting. It might be taken for granted for you, but a year ago I did not think so.

Jeff Dean: Look at Math Olympiad tasks as well; initially, it had to be translated into Lean language and use specialized tools; the second year, specialized geometric models were needed; by this year, it directly switched to a unified model, just the online official version model, only given a bit more reasoning resources.

This is actually good; it shows that the capabilities of general models have improved significantly, no longer needing specialized models. This is very similar to the wave of machine learning development from 2013 to 2016: before, every task required training a separate model; train one for recognizing road signs, train one for speech recognition. Now, the era of unified models has truly arrived. The key is how these models generalize to new tasks they have never seen, and they are becoming stronger and stronger.

Shawn Wang: And no need for domain experts. I previously interviewed someone from a related team; he said: I completely do not understand Math Olympiad, do not know where the competition is held or what the rules are; I just train the model. Quite interesting; now as long as you have machine learning, this general skill, give data, give compute, and you can handle almost any task. This is probably the so-called "Bitter Lesson."

Jeff Dean: I believe that general models will outperform specialized models in the vast majority of cases.

6. Future Model Knowledge Installed Directly, "Like Downloading Software Packages"

Shawn Wang: I want to follow up on this point. I feel there is a loophole here: the capacity of the model is abstract; the knowledge it can hold is only the number of bits corresponding to the parameter count. Everyone knows Gemini Pro has several trillion parameters, but specifically, no one knows. But for models like Gemma, many people want open-source, locally run small models; they inevitably cannot hold all knowledge. Large models have the condition to know everything, but small models, in the process of distillation and compression, actually remember a lot of useless things. So can we separate knowledge and reasoning?

Jeff Dean: You indeed hope the model makes reasoning as strong as possible while possessing retrieval capabilities. Using precious parameter space to remember obscure knowledge that can be looked up is actually not the optimal use. You prefer parameters to be used for capabilities that are more general and useful in more scenarios. But at the same time, you do not want the model to be completely detached from world knowledge. For example, knowing roughly how long the Golden Gate Bridge is, having a basic concept of "how long is a bridge," such common sense is useful. It does not need to know the length of some remote small bridge in the world, but possessing a considerable scale of world knowledge is helpful; the larger the model, the more it can hold. But I do indeed believe that combining retrieval and reasoning, making the model good at multi-round retrieval, will be a key direction.

Shawn Wang: And reasoning based on intermediate retrieval results will make the model look much stronger than it actually is. For example, personal Gemini.

Jeff Dean: It is unlikely that we will take my emails to train Gemini. A more reasonable way is: use a unified model, treating retrieving my emails and my photos as tools, letting the model reason and interact based on this information, completing tasks in multiple rounds. This makes sense.

Alessio Fanelli: Do you think vertical domain models make sense? For example, many people say "we want to make the best medical large model, the best legal large model." Are these just short-term transitional solutions?

Jeff Dean: No, I think vertical models are valuable. You can start from a very strong base model and then enrich the data distribution in vertical fields like medical or robotics. We are unlikely to stuff all robot data into Gemini training because we need to maintain capability balance. We will show it some robot data, but if you want to make an extremely excellent robot model, you need to train on the general model basis with more robot data. It might lose a little translation capability thereby, but robot capability will improve significantly.

When training the base Gemini, we have always been making this kind of data ratio trade-off. We really want to add data in over 200 languages, but this will crowd out other capabilities: maybe Pearl programming will not be that strong, Python can be preserved, but other niche languages or multimodal capabilities might be affected. So I believe the future is a combination of specialized models and modular models. You can simultaneously possess 200 languages, super-powered robotics modules, super-powered medical modules, and invoke them in different scenarios. For example, when handling medical problems, use the medical module together with the base model, and the effect will be better.

Shawn Wang: Installable knowledge.

Jeff Dean: Exactly.

Shawn Wang: Like downloading software packages.

Jeff Dean: Part of installable knowledge can come from retrieval, and another part should come from pre-training, for example, pre-training with 100 billion or 1 trillion tokens of medical data.

Shawn Wang: The Gemma 3 paper already has a hint of this flavor.

Alessio Fanelli: The problem is, how many billions of tokens do you actually need to catch up with the progress speed of frontier base models? If I want to make a stronger medical model, and the main model Gemini is still evolving continuously, do I need 50 billion tokens? 100 billion? If 1 trillion medical tokens are needed, that data simply does not exist.

Jeff Dean: Medical is a particularly challenging field. We do not have suitable access rights to much medical data, but many medical organizations hope to use their private data to train models. So the opportunity lies in: cooperating with large medical institutions to customize models for them; the effect is likely to be better than general models trained only on public data.

Shawn Wang: By the way, this is also a bit like the topic of language. One of your favorite examples is: put low-resource languages into the context, and the model can learn directly.

Jeff Dean: Yes, we used a language called Calaba, which is extremely scarce in resources; there are only about 120 people in the world who speak it, and it does not even have a written form.

Shawn Wang: Just put it into the context; stuff the entire dataset in.

Jeff Dean: For languages like Somali and Amharic, there is some text in the world. We will not put all data into Gemini training, but the more we put in, the stronger the model's capabilities become.

Shawn Wang: I personally have a side interest in linguistics; I took a few courses in college. If I were a linguist and could use these models, I would ask fundamental questions about language itself. For example, the Sapir-Whorf hypothesis: to what extent does the language you speak influence your thinking? Some languages have concepts that other languages do not, and there are also many concepts that are repeated. There is also a very famous paper mentioning "Platonic representations": for example, a picture of a "cup," paired with a large amount of text containing "cup," finally the representation converges to roughly the same position. This logic theoretically also applies to language, but there are places where it does not apply, and these inapplicable places precisely reflect unique human conceptual differences; some concepts do not even exist in English. This part I find very interesting.

Jeff Dean: Early on, I made a model that combined text representation and image models, trained on data like ImageNet, and then fused the top-layer representations. You will find that given a new picture the model has never seen, it can often give the correct label. For example, the model has learned telescopes and binoculars but has never seen a microscope. Show it a picture of a microscope, and it can actually output the label "microscope," even though it has never seen an image with this label.

Shawn Wang: That is so cool.

7. Started Pondering at Age 8: Using Compute to Build Large Neural Networks

Shawn Wang: Given your vision, we have talked about hardware, models, research; what kind of questions do you most hope to be asked?

Jeff Dean: There is one thing I find quite interesting. My undergraduate thesis in 1990 was on parallel training of neural networks. At that time, I felt that neural networks were the correct direction of abstraction, but compute power was far from enough. The 32-core parallel computer in the department could only run slightly interesting models, far from solving real problems. It was not until 2008 and 2009 that Moore's Law brought enough compute power, plus larger datasets, that neural networks truly began to solve real problems people cared about: speech, vision, and finally language.

When I started doing neural networks at Google in late 2011, I firmly believed: we must use large-scale parallel computing to scale up the size of neural networks. I even picked up some ideas from my undergraduate thesis, including model parallelism and data parallelism, and made comparisons. It can be said that I have been pondering these things since I was 8 years old, only the names were different back then.

Shawn Wang: Is that paper public? Can we find it?

Jeff Dean: Yes, it can be found online. Over the past 15 years, integrating these technologies together and going all out for scaling has been very critical. This includes progress at the hardware level, such as promoting the research and development of specialized chips like TPUs, and also at the software level, doing higher-level abstractions to allow people to more easily hand over ideas to computers to implement.

Shawn Wang: Did you agree with this viewpoint at the time? Or do you have a different review now?

Jeff Dean: Are you talking about the "brain market" mechanism for compute quotas?

Shawn Wang: Yes, compute quotas. David previously served as VP of Engineering at OpenAI and later went to Google. His core viewpoint is: OpenAI dared to go all in, betting everything on one thing; while Google is more "democratized," everyone has their own quota. If you believe scaling is important, then this is a key decision at the company level.

Jeff Dean: I partially agree. In fact, I even wrote a one-page memo at that time, saying that fragmenting our resources was very stupid. At that time, Google Research and the Brain team were making large language models, other departments were doing multimodal, and DeepMind was also making models like Chinchilla and Flamingo. The result was that not only was our compute power split, but our best talent and energy were also split. I said at the time, this is too silly; why not merge them together, concentrate forces to make a unified large model that is multimodal and all-capable from the start? This is the origin of the Gemini project.

Shawn Wang: Your one-page memo succeeded; very nice. Was the name decided at that time? Everyone knows you named Gemini.

Jeff Dean: I named it. There was another candidate name at the time, but I felt that two teams coming together, in a sense, is like twins. And NASA also had the Gemini program, a very critical step before the Apollo moon landing. So this name is very suitable, representing twins joining hands.

8. One of the Most Prolific Engineers in History Writing Code: Leading 50 "AI Interns"

Alessio Fanelli: Great. I know time is running out; finally, I am curious: how do you use AI to write code now? You can be said to be one of the most prolific engineers in computer history. I read an article about your collaboration with Sanjay; you said: find someone whose thinking meshes with yours to pair program, and the two of you combined will be a complementary force. I was wondering, how do you view code agents? How would you shape a code assistant compatible with your thinking? How would you rate current tools? Where is the future direction?

Jeff Dean: First, code tools are much stronger than a year or two ago; now you can really hand over more complex tasks to them. The way human engineers interact with code models actually determines how it cooperates with you in reverse. You can let it write complete tests, or you can let it help you brainstorm performance optimization ideas. The way you interact with it will determine its output style, the granularity of problem-solving, and whether you hope it is more autonomous or aligns with you more frequently. No single style is universal. For some problems, you need high-frequency interaction; for some problems, you just say "help me implement this" and that is it.

In the future, there will be more independent software agents to do various things for you. The difficulty lies in designing suitable human-machine interaction modes and interfaces, deciding when it should interrupt you: "I need more guidance" or "I am done, what is the next step." We do not have the ultimate answer to this part yet; interaction modes will change as models become stronger. You can imagine: you bring 50 interns; how would you manage them? If they are very capable, you might really want 50.

Shawn Wang: But the management cost is also very high.

Jeff Dean: Correct. But in the future, it is very likely that everyone will own 50 virtual interns. So how should you arrange them? You will definitely organize them into groups; you do not need to manage 50 people, just interface with 5 groups and let them handle the work respectively. What it will eventually evolve into, I am not entirely sure.

Alessio Fanelli: What about collaboration between people? The benefit of AI-assisted programming is that it can bring new ideas. But if a large number of code agents are writing code in parallel, it will be very difficult for others to intervene because they have to catch up with a huge amount of context. Are you worried that people in the team will become more isolated?

Jeff Dean: It is possible. But on the other hand, traditional teams without AI assistance, with 50 people working, naturally have a hierarchical organizational structure, with little interaction between groups. But if it is 5 people, each managing 50 virtual agents, the communication bandwidth between these 5 people might actually be higher than the traditional model of 5 group leaders coordinating 50 people.

Alessio Fanelli: Has your own work rhythm changed? Do you spend more time aligning architecture and design goals with people?

Jeff Dean: I think a very interesting point is: before, when teaching others to write software, they would say to write clear requirement documents, but everyone actually did not take it seriously. But now, if you want agents to help you write code, you must define requirements extremely clearly; this will directly determine output quality. If you do not say it needs to handle a certain edge case or emphasize performance requirements, it might not do it. People will become better and better at describing goals clearly and unambiguously; this is actually not a bad thing; regardless of whether one is an engineer, it is a useful skill.

Shawn Wang: I jokingly say that giving instructions to models now is no different from communicating with senior executives; it is like writing internal memos, word by word. And I think multimodal is very important. Google's Anti-Gravity team started with very strong multimodal capabilities, including video understanding. This is the highest bandwidth "prompt" you can give to a model; it is very strong.

Alessio Fanelli: How do you usually organize those experiences in your head? For example, your super-strong performance optimization intuition; everyone says you can see where efficiency can be improved at a glance. So if these experiences were written into general documents and then let models retrieve and learn, would it be very valuable? Like edge cases, it is a good example. People doing systems have specific edge scenarios in their heads, but now they have to repeat them every time. Do you think people will spend more time writing documents and refining general experiences?

Jeff Dean: I do indeed believe that well-written software engineering guides will be very useful. They can serve as input for models and also for other developers to reference, allowing them to be clearer about what the underlying system should implement when writing prompts. It does not necessarily need to be customized for every scenario; as long as there are general guides put into the context of code agents, it will be very helpful. For example, for distributed systems, you can list: what types of failures to consider, what handling solutions exist, such as Paxos replication, dual-write requests, tolerating failure as long as one returns, etc. Summarizing 20 such distributed system design techniques can largely improve the ability of code agents to generate reliable and robust distributed systems.

9. Latency Can Break Through 10,000 Tokens/s; Humans Do Not Need to Read Code Anymore

Shawn Wang: I was just thinking, when can Gemini build Spanner itself (the relational database that solved the CAP impossibility triangle of distributed systems)?

Alessio Fanelli: Maybe it has already seen all the code. This is a good example. The CAP theorem is an accepted truth that cannot be broken, but in the end, you still made something that seems to break it.

Shawn Wang: I am curious, does the model count as "breaking" it in some sense? Would you say you broke the CAP theorem? Under specific assumptions, such as precise clock synchronization.

Alessio Fanelli: Sometimes you do not have to stick to so-called truths. But models sometimes believe what you tell them too much.

Jeff Dean: Returning to the issue of prompts and iteration. I have always wanted to do a comparative experiment: one way is to use three fast but ordinary model calls, adding human alignment in between; humans look at the result once, then give a new prompt; the other way is to spend a long time writing a super-long, super-detailed prompt and directly throw it to a super-strong model to finish in one go. I want to see the difference in effect between these two ways. Often the effect is not good not because the model is incapable, but because the requirement description is incomplete; the model simply cannot guess what you want.

Shawn Wang: It is undefined; the model can generate 10 results, only one of which is what you want. Whereas using lightweight fast models for multi-round interaction is sufficient.

Jeff Dean: I attach great importance to latency. Low-latency interaction experience is much more comfortable than systems that are 10 or 20 times slower. In the future, we will see overall latency of models, software, and hardware be 20 or 50 times lower than now; this is crucial for systems requiring a large amount of interaction.

Shawn Wang: Now there are two extremes: one side is extremely fast, the other side is extreme deep thinking like DeepThink.

Jeff Dean: If cost and latency are not considered, everyone would always use DeepThink. If underlying hardware and systems improve latency by another 20 times and costs come down, there is no reason not to use it.

Shawn Wang: The Pareto curve will keep going up, constantly expanding. Let's ask some predictions. Do you have any small tests you have been paying attention to, or things you feel are not good enough now but will be realized soon?

Jeff Dean: I will mention two predictions that do not quite fit this category. First, personalized models that know you and can access all your authorized personal data will bring huge value improvements compared to general models. Being able to associate all my emails, photos, videos watched, and all information will be very useful. Second, increasingly specialized hardware will make model latency lower, capabilities stronger, and costs more affordable; this point will also be very critical.

Shawn Wang: You mentioned low latency; everyone generally measures it in tokens per second. Now it is about 100 tokens/s; do you think it can reach 1,000? Is 10,000 meaningful?

Jeff Dean: Absolutely. Because there is chain-of-thought reasoning. You can perform more rounds of deduction in parallel, generate more code, and then use chain-of-thought to verify correctness. 10,000 tokens/s will be very powerful.

Shawn Wang: At 10,000 tokens/s, humans do not need to read code anymore; just let the model generate it.

Jeff Dean: It may not ultimately output 10,000 tokens of code; maybe only 1,000 tokens of code, but behind it there is a 9,000-token reasoning process; the quality of such code will be much higher.

Alessio Fanelli: Just like "if I had more time, I would have written a shorter letter." Jeff, today was great; thank you for your time.

Jeff Dean: Very happy; thanks for the invitation.

Reference Link: https://youtu.be/F_1oDPWxpFQ

Disclaimer: This article is organized by InfoQ and does not represent the platform's views. Unauthorized reproduction is prohibited.