Z Tech | In Conversation with Zihan Wang: Leaving DeepSeek, and the Reverse Thinking That Defined My Journey

01 Introduction

Many first came to know Zihan Wang through Twitter.

At the time, following the successive releases of DeepSeek R1 and V3, Western tech communities became aware of this Chinese company's existence on a massive scale for the first time, thereby directing their attention toward researchers on the front lines. This young researcher's Twitter account suddenly attracted substantial attention. He still remembers today how Western practitioners were stunned by DeepSeek, as if witnessing a "mysterious force from the East," and how quite a few interesting rumors circulated, to the extent that even photos of CEO Liang Wenfeng were misidentified, and remain uncorrected to this day.

Initially, he simply wanted to do one simple thing: clarify the real situation—explain how DeepSeek conducts research, the team's working methods, and those technical details that were overlooked, hoping to provide a perspective closer to the front lines before information became distorted. Coincidentally, while preparing content for this interview yesterday, we encountered the release of DeepSeek V4, and Wang's early firsthand experiences at DeepSeek also supplement additional firsthand information about this mysterious company.

But what defines Wang more than this somewhat accidental rise to fame is an earlier and more stable technical path—his continuous exploration of Agent systems.

The timing of his entry into Renmin University of China to begin computer research happened to coincide with a "pre-paradigm" stage: GPT-2 had already validated the potential of generative architectures, but the mainstream focus of academia and industry remained on non-generative paradigms represented by BERT—deepening work around classification, information retrieval, representation learning, and task decomposition. It was from that stage that he steadily advanced along a clear yet understated technical path: starting from recommendation systems and information retrieval algorithms, extending to a reinforcement learning exchange program at Berkeley, and collaborative research on the MINT Agent benchmark with UIUC; subsequently entering DeepSeek, conducting in-depth exploration around Expert Specialization in MoE (Mixture of Experts) models, and in his subsequent doctoral stage, further probing the underlying mechanisms of Agent reinforcement learning, continuously questioning its capability boundaries and implementation paths.

Unlike many researchers who entered this field starting from large model capabilities, his starting point was more humble: Can an AI system, like humans, learn and improve autonomously without continuous external guidance?

Under this question, he introduced Markov Decision Processes (MDP) to abstract the Agent's decision-making loop: state, action, transition, and reward together constitute a self-consistent system. But his interest extended beyond traditional reinforcement learning's focus on "policy optimization" to a more challenging theme—building Agents that truly understand the world, completing rehearsal and simulation of the future internally before action occurs.

This also became the starting point for all his subsequent work. As a second-year direct doctoral student, he has published over ten papers at top AI conferences including NeurIPS, ICLR, CVPR, and EMNLP, with over 1,600 Google Scholar citations, and has received honors including the NeurIPS LAW Outstanding Paper and ICCV SP4V Best Paper. Whether it was the earliest exploration of Agentic scaling law, or the subsequently advanced RAGEN 1/2, VAGEN, MindCube frameworks, the core points to the same question: how to transform Agent decision-making from "response to input" to "judgment based on world evolution."

Image

Image: RAGEN 1, provided by the interviewee.

In this conversation, we attempt to return to the starting points of these questions: from his earliest research experiences, through his frontline practice at DeepSeek, to his current systematic thinking about Agents, to reconstruct how his personal research and exploration unfolded step by step. The following is the dialogue transcript between Z Potentials and Zihan Wang. Enjoy!

Z Highlights:

• Later I gradually discovered that many seemingly sophisticated ideas might just be packaging, and sometimes when reproducing experiments, they wouldn't run at all. I began developing discernment, able to see which works look glamorous with complex formulas but are actually unsound. I developed a reverse thinking: since some seemingly sophisticated fields might not be so, might some seemingly engineering-oriented fields actually not be that simple either, requiring significant effort to produce a paper?

• I was particularly struck at the time: how could there be a company with such high researcher density? Places I had been to before might have only 10 dedicated researchers out of 200 people, but at DeepSeek, nearly everyone among those 200 was doing research-related work to some degree. Even if not dedicated researchers, they would share the latest large model developments and major tech company updates in group chats daily. Even HR would forward relevant news. The atmosphere was completely different.

• Another thing that deeply impressed me: at DeepSeek there was a senior colleague working on infrastructure. The first time I submitted code, this colleague reviewed it line by line, finding optimization opportunities in every single line. For example, avoiding tensor recloning through in-place operations. I thought it was so amazing.

• Someone once asked me: what exactly is an Agent? I think whether something counts as an Agent depends on what physical or digital environment it is placed in. Give it a fully open computer environment, and it's OpenClaw; give it a restricted computer environment, and it's Claude Code or Codex; give it only a chat interface, and it's GPT. The degree of environmental openness determines the Agent's intelligence index from 0 to 1.

• Many task setups give you a sum of money to make the task as perfect as possible. But what's more important: a person or Agent with true resource adaptability can deliver $10,000 worth of results given $10,000, and $1 million worth of results given $1 million. What we hope to build is this highly resource-constraint-adaptive Agent.

02 From Renmin University IR to Berkeley RL: "No Connections? Break Through via Office Hours"

ZP: Welcome, Zihan. Let's start with your early research experiences. During your initial time at Renmin University, what opportunities led you to the AI field? Were there any particular stories at that time?

Wang Zihan: I entered AI relatively early, starting undergraduate studies in 2020 and formally beginning AI-related research in early 2021. This was thanks to Renmin University's training model: no major division in the first semester, with all science students taking classes together and high freedom in course selection. The school also offered courses in artificial intelligence and statistics. During that period, I actually leaned more toward statistics, as domestic consensus then held that undergraduates should build strong mathematical foundations, studying more mathematics and statistics.

But I didn't want to follow the statistics path exclusively, so I proactively contacted professors at the School of Artificial Intelligence and entered a research group. At that time, GPT-3 already existed, but research on text generation models was far less than non-generative models like BERT. In the group, I mainly worked on recommendation systems and search algorithms, using relatively basic DPR and RAG for QA tasks. Frankly, that research was tedious, lacking generative capabilities and requiring much manual fine-tuning. For example, doing QA required extracting spans from original text; doing conditional QA also required extracting conditional features, mapping conditions and answers one-to-one. Although the methods were traditional and manual, I already sensed AI's significance initially—our AI models' real-world applications were gradually shifting toward natural language, already much broader than the traditional structured data direction next door working on SVM.

ZP: When you first entered the AI field, were your topic selections or research directions basically arranged by your advisor in the group?

Wang Zihan: I chose my advisor based on good reputation in RUC's AI School and good student outcomes; initially it was more based on reputation and intuition. The direction changed later too. I started with information retrieval (IR). After completing that project, I developed ideas about going abroad and applied for an exchange at Berkeley in my junior year.

The direction changed several times after that. Looking back at my undergraduate stage, the most interesting part was still that IR research experience. We had a paper submitted to CIKM; the core question was: can we use generative models for information retrieval? At the time we tried having GPT generate tokens corresponding to documents one by one, with each document corresponding to a token sequence. For recommendations or searches, we had the model generate this token sequence, and whichever document it matched, we returned that document. The difficulty here was somewhat similar to early GPT hallucinations—ask it to cite literature, and it would fabricate non-existent entries. To solve this problem, we proposed constrained decoding, limiting the model to a document library, forcing it to decode only within token sequences in the library, ensuring generated results precisely pointed to articles within the library.

ZP: What did you gain throughout the overseas academic exchange process?

Wang Zihan: The Berkeley experience made me extremely independent. Among seniors I knew, only one had been to Berkeley, and that was for mathematics, completely unrelated to me, with no experience to reference. When I first arrived, I was unfamiliar with the place and people, and didn't even think I could find a professor to do research with.

Without existing connections, I found breakthroughs through classes. Taking courses allowed me to use professors' office hours to communicate directly and also offered opportunities to follow along and learn. I took Sergey Levine's reinforcement learning class at the time, asking questions proactively after every class. The course final project impressed me deeply; that's when I started using OpenAI Gym. I found RL particularly interesting, which was also why I later returned to RL after various detours. In my view, the difference between RL and SFT is that it gives models the possibility of self-evolution, like AlphaGo to AlphaZero. The course final project allowed autonomous topic selection. I noticed OpenAI's VPT (Video Pre-training) work, which let models learn dynamic models by watching videos, like humans learning operations by watching gaming livestreams. I implemented a lower-budget version in a simplified 2D Minecraft-like environment, with decent results, and also got full marks in that class.

At the time I was still in the exploration stage and was quite satisfied with that full-mark project. But I also realized I couldn't stay just at the course level. I saw classmates turn course projects into papers and successfully get them accepted, which I found inspiring.

I proposed doing research with Sergey, and he referred me to a senior, but after talking, we found our direction interests didn't match very well. After that, I also tried contacting several other groups, both within Berkeley and outside, and worked seriously on some for quite a while, but for a long time didn't actually complete a concluding project.

At first I thought research was a sacred endeavor, requiring studying grand concepts or sophisticated ideas. But later I gradually discovered that many seemingly sophisticated ideas might just be packaging, and sometimes when reproducing experiments, they wouldn't run at all. I began developing discernment, able to see which works look glamorous with complex formulas but are actually unsound. I also no longer held a reverent attitude toward research like in my freshman and sophomore years, viewing others' work more from an observer's perspective.

This mindset persisted until applying for summer research. I developed a reverse thinking: since some seemingly sophisticated fields might not be so, might some seemingly engineering-oriented fields actually not be that simple either, requiring significant effort to produce a paper?

At that time I found mentors at UIUC: Heng Ji and mentor Xingyao, who now works at All-Hands AI doing Coding Agents entrepreneurship. We discussed whether to build a benchmark together. Many people think benchmarks are simple and not "sexy," but through previous thinking, I realized that seemingly simple things also require extreme rigor behind them, such as building classification systems, defining capability dimensions, and writing large amounts of rigorous test cases. Only then did I understand that doing benchmarks itself is not easy.

I found him in March 2023, and at that time he proposed we could work on an Agent benchmark together.

ZP: At that time, what did people understand an Agent to be?

Wang Zihan: ChatGPT appeared at the end of 2022, and for the first time many people realized AI could converse fluently, but few took the next step to think: besides chatting, can AI actively manipulate tools in the real world? Can the tokens it generates itself be transformed into real actions, and after execution, read environmental feedback? At the time, people's thinking inertia was very strong. Previously doing QA still generally used BERT to extract features, and jumping out of this inertia was actually quite challenging.

We had just started planning the Agent benchmark when Meta published Toolformer in February 2023, which was one of the most cutting-edge Agent-related works at the time. It defined five tools including calendars and calculators, letting Agents complete simple math problems and other tests. Although it proposed basic tool-use thinking, it didn't form a systematic benchmark.

So we thought: since everyone sees Agent potential, what's the next step? We realized that in the process of Agents interacting with the world, there are two types of core resources that are crucial: one is tools, and the other is humans.

At the time ChatGPT was also advancing tool capabilities, so we conceived a tool plus human feedback Agent architecture, somewhat similar to the later TauBench approach: letting Agents be able to call a series of tools and continuously optimize decisions combined with human feedback. The essences of these two types of feedback are fundamentally different:

• Feedback from tools is verifiable objective facts, such as query and calculation results, which Agents should directly use as real bases;

• Feedback from humans is noisier, such as users accusing, expressing unclearly, or even requiring Agents to ask counter-questions to clarify intentions.

Image

Image: MINT benchmark framework, provided by the interviewee.

Based on this, we built a benchmark test integrating tools plus Agent plus simulated user. This work was completed after the summer research ended, released around September 2023. Since then, I began systematically and deeply researching Agent-related directions.

ZP: Capabilities at that time made Agents still too difficult. Models' tool-calling capabilities were relatively weak, and there was no decent reasoning, so frameworks including multi-Agent were basically difficult to land.

Wang Zihan: Especially at that time there weren't really suitable tasks for Agents to use; overall capabilities couldn't support complex scenarios. In the end, what could be done was basically just RAG and code-related tasks—letting models write code themselves, pass validators, and then iterate based on returned results. Now looking back, the most mainstream scenarios for pure-text Agents are still these two types: search Agents and code Agents.

ZP: From 2024 to now, do you observe that benchmarks have become saturated?

Wang Zihan: Conditions were actually very limited at that stage. At that time, even graduate-level QA benchmarks like GPQA didn't exist yet; we mainly used HotpotQA, TheoremQA, and code-class HumanEval and MBPP. From today's perspective, tasks on these datasets back then have basically been handled quite maturely by today's Agents. Changes over these two-plus years have indeed been enormous.

03 Wang Zihan's DeepSeek Experience: A 200-Person Company Where Code Is Reviewed Line by Line and Even HR Discusses Model Progress

ZP: After this, you entered DeepSeek after your junior year ended. What kind of beginning was this for you? What story led you to DeepSeek?

Wang Zihan: After returning from UIUC summer research, I started applying for PhD programs. Fortunately, I received an offer from Manling's research group at Northwestern University; I had talked with her before, and our directions and styles matched very well. After that, I formally applied and confirmed my destination.

After confirming the PhD, I had a semester similar to a gap period. During that time my mindset was very relaxed; the direction was already set, no longer bearing various uncertainties, so I happily sent out resumes.

At that time I only applied to two companies: one was DeepSeek, the other was a startup. Both gave offers, and I finally chose DeepSeek. The process was actually quite smooth; I didn't mass-apply. I thought I'd try casually, and if it didn't work out, I'd play and relax for the second half of my senior year. But in the end, the interviews went relatively smoothly.

DeepSeek felt different to me; they weren't conducting rote memorization interviews. Instead, they asked very targeted questions combining my research experience and the company's technical direction. Later I discovered many DeepSeek colleagues have this style. This company highly customizes interviews, showing they are very dedicated to every candidate, at least checking your resume, your research, and what you're doing beforehand. This feeling was similar to my PhD interview at the time: they care about you as a person, hoping you can land a specific research project after joining, rather than randomly assigning miscellaneous tasks and being done with it. It was precisely this point that moved me, so I joined.

ZP: At that stage, DeepSeek was still a place that wasn't so closed off. Now they basically don't recruit short-term interns. Were there many people at that time? What was the scale?

Wang Zihan: The company had about 200 people at the time. I was particularly struck: how could there be a company with such high researcher density?

Places I had stayed before might have only 10 dedicated researchers out of 200 people, but at DeepSeek, nearly everyone among those 200 was doing research-related work to some degree. Even if not dedicated researchers, they would share the latest large model progress and major company updates in groups daily. Even HR would forward relevant news. The atmosphere was completely different.

ZP: What did you mainly do at DeepSeek? Did you do your own research, or mainly participate in mainstream model training and inference?

Wang Zihan: I did both, mainly two projects: one was V2 development, and the other was expert specialization tuning.

V2 belonged to new model R&D; company employees all participated, and everyone used this model daily at the time. I would focus on observing the model's output logic and fluency. If problems arose, I would trace causes and provide feedback. This work leaned more toward engineering direction. At the time I was more in a learning mindset; after all, there were many seniors in the company, with many strong people—learning was earning.

From V1 to V2 iteration was a process of multiple ideas colliding. The core results outsiders saw might only be the MLA architecture and finer expert division, but internally it actually involved architecture optimization, post-training tuning, data collection, and other multiple links. Every day you could contact various innovative ideas; this was a very good learning opportunity. Through communicating with colleagues about model design logic, I also accumulated large amounts of model R&D intuition, such as which metrics to pay attention to, potential impacts of specific code on model performance, etc.

Another thing that impressed me deeply: at the time there was a senior colleague doing infrastructure at DeepSeek. The first time I submitted code, the senior reviewed it line by line, finding optimization space in every single line. For example, avoiding tensor recloning through in-place operations. I thought it was so amazing.

The project I was responsible for was more exploratory. At the time the company was gradually migrating to MoE (Mixture of Experts), with core needs being solving MoE model specialization fine-tuning problems. At that time, industry work related to fine-tuning basically all used LoRA and its variants, core being compressing parameters through matrix decomposition without needing to adjust all parameters. Although this method could achieve goals, when applied to MoE models, we discovered optimizable space.

MoE models themselves already have explicit expert structures, and the reason LoRA only needs few parameters is corely through few parameters leveraging local parameters in the model related to tasks, essentially also searching for parameter decompositions effective for tasks. And MoE's expert structure happens to provide this explicit decomposition. We discovered in early pilot research that DeepSeek's insisted-upon fine-grained MoE had expert differentiation degrees far superior to "eight-choose-one" expert structures used in some papers on the market at the time—different tasks activated completely different experts. At the time, I had an idea: since fine-tuning's core is updating parameter coefficients, can we directly locate experts most relevant to the task and conduct targeted fine-tuning for them? This thinking eventually formed our ESFT paper (published at EMNLP 2024).

Actually, at the time I was finding solutions around needs. That's when I deeply realized that as long as there are clear needs, writing papers based on needs is very efficient. All my fast-written papers afterward followed this logic—discovering a core need not yet paid attention to, then targetedly landing solutions. This is completely different from the experience of spending large amounts of time polishing details and sculpting forms.

From the work itself, achieving parameter updates through targeted fine-tuning of relevant experts has two core advantages. One is saving memory resources; the other is reducing noise from irrelevant experts in MoE models, improving training signal-to-noise ratio. If forcibly making irrelevant experts fit the current task, it would cause the model's performance on other tasks to plummet. And our method can, while fine-tuning new tasks, cause the model's performance on original tasks to barely decline at all—corely not interfering with irrelevant experts, also avoiding the model overfitting to single tasks.

ZP: So MoE was a direction DeepSeek determined very early. How did they determine it? After all, besides the MoE mixture-of-experts architecture, early models like Qwen, GLM, and Llama were all dense models; only GPT-4 adopted MoE architecture. Why could DeepSeek judge so early that MoE was the future development direction?

Wang Zihan: I think the core is "truth emerges from experiments"—DeepSeek's internal experiments were conducted with extreme rigor. I learned an important concept there: merely believing in a direction yourself is not enough; you must also leave ample space for debate and verification of opposite viewpoints. Even if the team subjectively already approves a conclusion very much, they still conduct large amounts of ablation experiments, assuming the opposite viewpoint holds, to verify its feasibility and find potential problems.

When I was doing ESFT (Expert Specialization Fine-Tuning) related papers myself, I deeply experienced this. Even if I was already very certain my method was feasible, my mentor would still constantly ask me: if this method is not feasible, where would the problem lie? afterward I conducted large amounts of ablation experiments, repeatedly verifying and confirming the method's effectiveness before finally organizing it into a paper. Our time doing core experiments was actually only one month, but doing ablation experiments and rigorously polishing the paper took much longer.

DeepSeek is like this, treating every technical direction with extreme rigor, comprehensively testing various components and characteristics. Only after repeated verification and confirming practical feasibility will they determine the direction. I think it was precisely this rigorous experimental attitude that let them judge so early that MoE was the core direction of the future.

ZP: In my impression, DeepSeek was also relatively early in proposing the fine-grained MoE concept, with sparsity ratio reaching 1:32, sparser than eight-choose-one or four-choose-one architectures. This design might belong to different MoE architecture thinking, or might be an engineering-driven choice. After the V2 project, did your related MoE research results ultimately apply to the model's final solution? Or are they still at the research stage?

Wang Zihan: This has to mention post-training related work; actually this involves two directions. The first direction is similar to current Thinking Machine Labs, core being based on large models, customizing small models for customers, doing training optimization and deployment services through APIs. At the time OpenAI, ByteDance and other companies had already launched similar fine-tuning functions—they provide model bases; users don't need to understand underlying architecture, just need to train based on the base to get customized models. But by the time DeepSeek V3 launched, the company's priorities were more focused on improving model capabilities, so the commercialization of this customization direction had its priority lowered.

Image

Image provided by the interviewee.

The second direction is more exploratory, core being not letting downstream users customize and train models—although we have already achieved the advantage of not affecting original task performance when fine-tuning new tasks, what we want to further explore is: can we allocate different tasks to different task groups according to their nature, where tasks within each task group require relatively similar capabilities, and for each type of task group, only fine-tune the experts they prefer most. This way, when training any task, we can slow down the "seesaw effect"—for example, training task a causing task b's performance to decline, thereby needing to repeatedly train all tasks. This direction was already clear at the time, but because I had already started school at Northwestern University and couldn't continue full-time work at DeepSeek, I couldn't continue pushing this research.

ZP: Did you ever think about delaying enrollment for half a year to continue working at the company? For example, waiting until the V3 project ended?

Wang Zihan: At the time I did consider both choices of staying or leaving. The reason I ultimately chose to pursue a PhD in the US was largely because the research directions in Manling's group in the US were completely inaccessible to me domestically at the time, including VLA, robotics technology, and various multimodal content.

Image

I felt the multimodal field was very attractive at the time because domestically, among the research groups I could contact, very few focused on multimodal research. This was actually a choice of direction; I inherently like exploring new fields—during undergraduate studies, due to various reasons, I also changed many research directions. I even did work related to LLM personality personalization in the middle, although it ultimately didn't produce papers, that exploration experience also gave me many gains. So choosing to pursue a PhD at the time was corely out of consideration for research direction.

ZP: If I recall correctly, there was also a small episode—after R1 and V3 launched, you received high attention on Twitter. What exactly was the situation during that period?

Wang Zihan: The deepest experience during that period was that after Western industry professionals learned about DeepSeek, they produced strong shock feelings. I can hardly find appropriate language to describe it—roughly like they witnessed a mysterious force from the East. At the time there also appeared many rumors I had never heard, and even now, many photos of Boss Liang that people post on Twitter are still wrong, never corrected.

At the time I had a lot of content I wanted to share, such as wanting to truly show everyone DeepSeek's working state, and what I felt were the company's sentiments and core values. At first I even wanted to help promote the company, because when I joined, the company only had about 10,000 Twitter followers, but later the company's influence gradually rose, completely not needing my promotion anymore.

Actually I liked posting videos on Bilibili since I was very young. When I have strong expressive desire for something, I can often inspire much inspiration, including some ideas and interesting memes—these memes can amuse myself and also make others smile knowingly, and after smiling, also trigger thinking about related issues. During that time on Twitter, I talked most about open-source related topics. Although the overall industry is gradually moving toward closed-source, being able to do a little resistance for open-source at that time still felt meaningful.

ZP: One impression DeepSeek gives me is that it has very strong capabilities at the infrastructure level, also emphasizing synergy between infrastructure and algorithm. When they write papers, they also relatively carefully expand on operators and scheduling and other implementation-level content. In such an environment, did you receive any influence?

Wang Zihan: The most typical example is what I just mentioned: the first time I submitted code, my mentor reviewed it line by line, finding optimization space in every single line. Actually compared to other MoE models on the market at the time, even DeepSeek's open-sourced V2 version code, its inference part was only 10 to 20 lines of code different from other models, but every single line was meticulously designed.

Even without understanding the company's internal situation, just looking at the open-source version, the quality was also very excellent—computational efficiency far exceeded other models on the market at the time.

This involves infrastructure-level detail optimization, such as how computational graphs calculate gradients, how gradients backpropagate, how to achieve optimal communication, how to save resources by reducing tensor creation, etc. I think the core of this culture is a resource budget consciousness—under limited resources, how to make optimal decisions. Actually when I joined, the company's resources were very sufficient; 200 people equipped with 10,000 GPUs was completely unimaginable to me during undergraduate studies. But later I also realized that to train a super large model, 10,000 GPUs still appears insufficient, which also highlights the importance of infrastructure optimization and efficient resource utilization.

ZP: Very coincidentally, the day before our article was published, DeepSeek released V4. What do you think of this new release?

Wang Zihan: I have nothing special to say about the model and technical routes; I think they have always been on the right path. But I really like a sentence in the V4 release announcement: "Not tempted by praise, not frightened by slander, follow the Way and act, rectify yourself properly." For any researcher, persisting in doing what you think is correct, keeping steady forward progress, solidly verifying every hypothesis, and minimizing the impact of external noise—this is the fastest direction forward!

04 Agent Systems: Environmental Openness Determines the Ceiling of Intelligence, Rather Than Computing Power or Data Scale

ZP: You wanted to build Agent systems very early. The first project you did after joining Northwestern for your PhD—what problem did you want to solve, and how is progress?

Wang Zihan: The core original intention of my doing Agent-related projects is hoping Agents can learn autonomously without deliberate human teaching. This is influenced by my upbringing; my parents always guided me to learn autonomously, also making me lean more toward RL thinking. I always believed the final form of RL would change significantly relative to existing "experience plus gradient descent" patterns, with the core being letting models achieve autonomous improvement, which is what everyone later called self-evolving.

The first related research I did was Agentic scaling law. At the time we abstracted Agents as Markov Decision Processes (MDP) containing states and actions. The core thinking was that judging whether an Agent understands the world cannot just look at strategy (given state s output action a), but must be able to do "cloze tests" on any link of the MDP, excavating its world modeling capabilities—such as predicting next states through actions, or reverse-inferring actions through states and subsequent states. This is also the core logic of our laboratory's current work, such as VAGEN (Vision Agent, NeurIPS 2025), which is essentially the implementation of this cloze test thinking.

Initially I tried designing a unified cloze test framework but didn't succeed. Later I adjusted my thinking, deciding to proceed step by step and first do strategy well. After starting my PhD, I discovered the Verl framework could be applied to Agent building, so I did a simple proof of concept (PoC), from which RAGEN was born. The first version of RAGEN didn't do much engineering optimization; efficiency was inferior to SGlang at the time, and I also realized the importance of engineering optimization. The subsequent primary task was to tackle this difficulty.

RAGEN's first version was released on January 27 last year. Coincidentally, the first anniversary of RAGEN on January 27 this year was also the 10th anniversary of DeepMind's AlphaGo paper publication. Over the past year, I have experienced many research failures and also summarized new research arguments. Currently I am repositioning based on this set of arguments and carrying out new explorations. The first-generation RAGEN was also the core work of my first semester at Northwestern.

Image

Image provided by the interviewee.

ZP: RAGEN's second generation mainly focuses on reasoning failure cases and RL failure modes. It also transformed from a research leaning toward infrastructure definition to an observation-based paper. In this paper, what were your main observations? What methods do you think can improve this observation?

Wang Zihan: We reviewed thousands of experiments recorded on Weights and Biases last year and discovered that across different domains of reinforcement learning, multi-turn Agentic RL domain advancement difficulty far exceeds the reasoning domain.

In reasoning domains like mathematics and coding, model reasoning length increases with training, intuitively reflecting the model gradually learning to think deeply. But in multi-turn Agent RL domains, after testing 20-plus tasks, we could never reproduce this phenomenon; instead model reasoning length continued to decline. We believe length is just the surface; we need to more deeply understand what this length truly reflects about model reasoning capabilities and decision logic.

ZP: Is the cause of this phenomenon related to the environment you defined? Is the framework you are in or the environment you define software engineering or code, or similar small games?

Wang Zihan: Our experimental environment leans more toward out-of-distribution (OOD) scenarios, namely environments Agents are not familiar with. Code or mathematics tasks are generally heavily trained during model pre-training and post-training phases, so the reasoning length decline phenomenon is more moderate when doing Agent RL, but such regular tasks only occupy part of Agents' actual application scenarios. Besides this, there are large amounts of Agent actual usage scenarios, such as GUI Agents (i.e., web clicking), games (like Sokoban), etc.—these are all tasks Agents are not familiar with.

More challenging is that training cannot exhaust all benchmark tests; OOD tasks inevitably appear during testing. Our laboratory took state perplexity as an OOD environment detection indicator in the SPA paper, discovering that Sokoban task perplexity approached over 200, far higher than WebShop, mathematics, coding, and other tasks.

Image

Image provided by the interviewee.

Our goal is to deploy Agents to reality, and OOD scenarios in reality are most prone to problems, needing focused strengthening of understanding. Moreover, "reasoning length decline" is not limited to OOD tasks; in in-distribution tasks, it may also occur due to Agent reasoning noise, causing accidental correct answers that shorten reasoning chains.

ZP: Does this phenomenon of "accidental correct answers then shorter reasoning chains" manifest consistently across different task types?

Wang Zihan: Differences are very obvious. Programming and mathematics tasks have extremely strong causal chains—process correct then result correct. But Sokoban, WebShop and other Agent tasks may complete tasks even with wrong steps, and these tasks' state transitions mostly carry randomness. I once encountered GUI Agent business during an internship and discovered that long-horizon multimodal Agent training is very difficult. For example, letting Agents book flights through web clicking remains an unconquered difficult problem. We observed that while model performance improves, reasoning becomes increasingly fragile, subsequently abstracting the "template collapse" phenomenon—models tend to output formulaic talk that doesn't change with prompt.

So how should formulaic talk be defined? Essentially, it refers to reasoning chains that don't change with question changes—no matter what prompt is input, models tend to repeat the same expressions. Realizing this, I began searching for theoretical frameworks to explain this phenomenon. Thus I returned to the bottom layer of information theory to read early papers, finally realizing: for input X and reasoning Z, the total diversity of reasoning H(Z) is composed of two parts. The first part is "multiple solutions for same question"—the diversity of reasoning chain Z given input X, namely conditional entropy H(Z|X). The second part is "different solutions for different questions"—whether the distribution of reasoning Z differs between different inputs X, namely mutual information I(X;Z). H(Z)=H(Z|X)+I(X;Z) is the result of information theory's development over decades, and no one had ever tried using it to explain LLM Agent reasoning collapse phenomena.

Image

Image provided by the interviewee.

However in experiments we observed that as training deepens, later reasoning and input mutual information drops to almost nothing. Although we tried various ways to increase reasoning entropy, the results backfired: the degree of differentiation between model-generated content across different prompts became smaller and smaller.

ZP: What attempts did you make in the RAGEN V1 stage regarding this kind of problem?

Wang Zihan: We tried prompt filtering: after trajectory rollout completes, the system checks whether rewards differ between different samples under the same input; if all rewards corresponding to a prompt are the same, we consider this prompt unable to produce training signals, similar to grading Chinese compositions where writing 5 articles all get the same score with no contrast or room for progress, and directly eliminate it.

This is not our original creation; industry also produced similar thinking like DAPO around the same time. DAPO looks very promising, but couldn't take effect on our Agent tasks, corely because it only eliminates prompts where scores are completely identical between different samplings, while Agent task reward systems are often not binary (0/1) rewards, reward systems are complex and Agent sampling has strong randomness, so we adjusted our thinking.

In RAGEN V1, we made a simple heuristic attempt and discovered this might be related to reward variance—evaluating task learning价值 by observing reward variance. If reward variance is larger, it indicates the Agent's current strategy has unstable rewards on this task, and we retain such samples; conversely, we eliminate them. The V1 version fixedly retains top 25% or 50% high-variance samples; the V2 stage further explored reasons for prompts being indistinguishable, discovering that the lower the training samples' reward variance, the faster mutual information between reasoning process and input declines.

ZP: So what is affecting mutual information?

Wang Zihan: After exploration, we discovered that what affects mutual information is corely two types of noise. These two noise sources are respectively: regularization terms introduced internally by algorithms to maintain stability, and environmental random noise in the rollout process itself.

One is noise from regularization terms; when reward variance is extremely low, the advantage function approaches zero, and gradient updates are mainly dominated by regularization terms (KL divergence or entropy, etc.), pushing the model to a position outputting single stable reasoning chains. The second is noise from random environments—even if completely different reasoning is adopted, it's possible due to noise to lead to the same result, making the model think different reasoning might have the same gains, might as well stably output a simple reasoning, ultimately making reasoning chains uniform.

ZP: Infrastructure-level bugs are also within your defined noise category?

Wang Zihan: Last summer I read recent papers on tokenization mismatch and FP16 versus BF16 (training-inference precision conversion causing inconsistency) in large language model RL, discovering that over the past year there were various infrastructure problems in RL's underlying frameworks, yet even so could still train successfully, showing signal strength is sufficient.

Since noise at various levels is hard to completely eliminate, we shifted our strategy from "eliminating noise" to "controlling signals," eliminating parts with weak signals and no learning value, ultimately designing the SNR-aware filtering (signal-to-noise ratio-aware filtering) adaptive training scheme. Its core is real-time evaluation of sample signal-to-noise ratio during trajectory rollout, only updating parameters for samples with strong signals and incremental learning value, both avoiding noise interference while also saving GPU resources and time costs. Specifically, we sort prompts by reward variance, imitating the Top-P algorithm to retain samples ranking foremost in cumulative contribution. Currently this method has achieved performance improvements on multiple synthetic and real, single-turn and multi-turn, vision and text modality tasks.

Image

Image provided by the interviewee.

Compared to solutions like DAPO that can only eliminate "no signal" samples, our SNR-aware filtering built on the RL information theory framework provides engineers with a knob (Top-P threshold), allowing task-specific adjustment of "rejection zones." For tasks with high signals, reject fewer samples and learn more; for tasks with low signals, reject more samples to ensure learning high-quality content. Regarding control of the knob itself, compared to Top-K Filtering which fixedly selects samples from top K prompts, Top-P can dynamically capture samples with higher signals at different training stages targetedly, with higher training efficiency and better assurance of sample quality.

ZP: Since rollout occupies the main compute, then after filtering some samples are still discarded, does this mean this computing investment is wasted?

Wang Zihan: Saving computation time is not the core value. Regarding the question of whether filtering requires more samples to converge, we did comparative experiments: when the number of sampled samples is the same, models with filtering enabled perform significantly better than those without, proving that updates from low signal-to-noise ratio samples are not only unhelpful but also generate interference.

At the time RAGEN was being submitted to NeurIPS; reviewers raised many questions. Plus during my internship, Agent RL experiment progress was not as expected. Every day returning to my workstation, seeing that under the same experimental settings, several different, nearly random result curves would even appear—that deep sense of confusion once made me very down. Fortunately, we ultimately found a way to explain instability in RL training and also found methods to make RL training more controllable.

ZP: Summarizing, there are reasons why prompts exhibit low variance: it might just happen to have one correct result, might be because the task is too difficult causing the model to always guess wrong, or might be because the task is too easy causing the model to get it right every time. Essentially, this shows this prompt might not be suitable for the current stage of model training, so filtering it out is the relatively correct choice. Forcing through post-processing to artificially make it high variance has no substantive meaning. Finally what did you observe—for prompts with relatively large variance, do you think they fall on some cases at the model capability boundary? How do you define these cases?

Wang Zihan: Indeed, prompts with large variance just happen to fall on the model capability boundary—model performance is sometimes good sometimes bad. Such samples have the highest training cost-performance ratio, but this has not yet completely revealed the essence of real learning. In reality, tasks occasionally done correctly but mostly wrong have the most learning value. The core problem is that the current RL paradigm relies on gradient descent, causing the learning process to distort, difficult to distinguish true logic from fluke results.

The ideal learning state is tasks with clean gradients and high signal-to-noise ratio. Our research also proves that the larger the reward variance, the less likely gradient signals are to be buried by noise. Nevertheless, I am full of expectations for paradigm shifts in RL this year; perhaps everyone will return to prompt research. I myself have recently become very obsessed with this, feeling it's a return to simplicity. And now often, doing prompt optimization effects are even better than doing gradient descent.

ZP: Returning to RL, including Agentic RL, RL in the mathematics domain—do you think this scaling route might pause? Is the whole still in high-speed growth stage? Do you think scaling is already insufficient and needs new paradigm breakthroughs, or scaling itself is enough?

Wang Zihan: Speaking of scaling, I think the key is scale what. Now the industry generally scales compute, while some people value data more. Someone once asked me: what exactly is an Agent? I think whether something counts as an Agent depends on what physical or digital environment it is placed in. Give it a fully open computer environment, and it's OpenClaw; give it a restricted computer environment, and it's Claude Code or Codex; give it only a chat interface, and it's GPT. The degree of environmental openness determines the Agent's intelligence index from 0 to 1. Returning to your question: Agent RL's scaling law—I think the core is still what kind of environment you can provide it.

05 The Next Phase of Agents: Resource Adaptation—Delivering $10,000 Results with $10,000, and $1 Million Results with $1 Million

ZP: Besides scaling environment, what other aspects of the model itself do you think need improvement? For example, long context, generalization capabilities. Do you think generalization is inevitably achievable, or essentially impossible?

Wang Zihan: In my conversations with GPT, I discover it imitates me faster and faster now, showing everyone values memory capabilities highly. I think what's truly difficult to break through now are still tasks close to real human societal decision-making. Reality lacks RL training environments and trial-and-error opportunities; being able to collect small amounts of offline data is already not easy.

Of course we are also trying to build environments. We are working with some researchers to build scenarios close to reality. We are cooperating with teams from Yale, MIT, and NUS to do O2 AI company (o2tech.ai), developing Agent harnesses that can deeply access vertical enterprise environments, and based on this building "resource-adaptive" Agent full-stack systems (Infrastructure / Benchmark / Service / Research). We build Agents based on electronics manufacturing and recycling supply chain scenarios, capable of directly interacting with enterprise real-time data, understanding enterprise resources (such as inventory, time, resources, personnel), and accordingly guiding how to make enterprise decisions, such as when warehouses are full or when inventory needs clearing. This interaction based on real business logic is extremely practically valuable; I think this is a key link that future Agent development cannot bypass.

Image provided by the interviewee

Image provided by the interviewee

Agent is gradually transitioning from an "execution role" to a "decision-making role" in human society, and building Agents with decision-making capabilities will become increasingly important. Why must we let Agents manage these complex enterprise affairs in the future, rather than traditional models? First, Agents can make decisions requiring more complex context. When humans judge whether a decision is reasonable, they do not merely calculate returns based on historical data; they must also consider policy changes, business partnership intentions, and numerous other unstructured variables—areas where traditional models struggle. Therefore, we must rely on Agents.

In reality, there are not many opportunities for trial and error, making the construction of sandbox environments an inevitable choice. Thus, we are currently developing resource management Agents. Our research focuses specifically on: how Agents should behave under different budget constraints. Many task setups give you a sum of money to accomplish the task as effectively as possible. But more importantly: a truly resource-adaptive person or Agent can deliver ten thousand dollars' worth of results when given ten thousand dollars, and one million dollars' worth of results when given one million dollars. What we hope to build is this kind of highly resource-adaptive Agent. In reality, every department's initial funding and resources are unequal and filled with random constraints. How to make Agents intelligently utilize resources under resource constraints is a question well worth exploring yet currently has almost no corresponding benchmark. This is why companies like O2 AI, which build environments and Agent systems using real enterprise data, better align with actual human decision-making needs.

Token generation itself is a form of resource consumption. Nowadays, many code-based Agents—even when simply asked to say "hello"—might consume 10,000 or 20,000 tokens, which is quite unreasonable. In response, many researchers are currently exploring how to optimize inference costs.

But I believe current research has not yet touched upon the more essential proposition: Budget is not about spending less; the core is efficient matching of input-output ratios. The true challenge is delivering results commensurate with the money invested. Most current work on efficiency and budget constraints suffers from deviation—many approaches pursue "the less the better," whereas the true direction should be efficiently converting existing resources into target returns. This represents a completely different, and more realistic, optimization approach for application scenarios.

ZP: In the future, would you tend to stay in academia or industry? What is your perspective on the logic behind both?

Wang Zihan: Regardless of where I am, I want to do research. Doing research itself is joyful—it is the process of discovering new problems and determining which problems matter more—so wherever I am, I will persist in doing this.

ZP: If you were to rank the three most important problems in the current LLM/Agent field, which would you choose?

Wang Zihan: The first is resource management. As previously mentioned, when we want Agents to participate in high-impact decision-making, resource management becomes their foundation for survival. In actual Agent deployment, entering any new environment (such as enterprise ERP) requires learning that environment's resource management logic.

This naturally extends to the second problem: world model. The industry has many definitions of world models; our laboratory focuses more on the Agent's own world model—specifically, whether it can autonomously judge what impact an action will produce. Current mainstream RL algorithms still struggle to give Agents systematic access to this kind of explicit predictive capability. Budget itself is also a world model; you must predict how much overhead and hidden costs an action will incur.

World Model Sudoku grid meme created by Wang Zihan

World Model Sudoku grid meme, created by Wang Zihan

Another direction that excites me greatly is deep value estimation by Agents. O2 AI develops vertical enterprise decision-making Agents that require not only general decision-management capabilities but also rely on vertical domain knowledge to accurately assess residual values of electronic components: the same batch of materials has completely different residual values depending on market cycles, inventory status, disassembly paths, and sales channels. This vertical value estimation capability might even prove transferable to gaming, trading markets, and other scenarios in the future. Pricing is an excellent entry point because it is verifiable—using massive transaction closing prices as anchors, Agents learn to predict closing prices and extract judgment logic. Although market fluctuations introduce noise, RL itself is a process that balances strategy learning with denoising; the more judgment paradigms accumulated through continuous learning, the faster Agents evolve when facing new scenarios.

ZP: Does this mean achieving true real-time competitive-level AI requires co-design of algorithm, infra, and the entire I/O?

Wang Zihan: Indeed, full-stack level coordination is required—a challenge of universal significance. This kind of real-time responsiveness is a capability that humans possess but Agents currently lack.

Beyond this, continual learning is another crucial proposition this year. We need to consider: Why do humans learn things faster and faster, especially with AI assistance—learning a new field is also becoming increasingly rapid.

How do we give Agents this ability to learn faster and faster? The core lies in allowing Agents to internalize and transfer accumulated experiences to entirely new tasks through long-term processing of diverse tasks. Take myself as an example: recently I have been researching video generation. Although I previously only worked on video understanding rather than generation, my learning speed for this new field is much faster than before. This speed increase is essentially a manifestation of continual learning capability. To give Agents this ability requires a diverse test bed for them to keep learning. My current thinking is to let Agents actually play those games. If there truly exists an Agent capable of beating all games in the world, in this process, it must have learned something very meta.

ZP: I just realized a key issue. Currently, the most mature Agent environments, such as code and mathematics, have verifiable rewards and can be closed-looped through chain-of-thought; gaming environments feature strong interaction and low trial-and-error costs. But once we reach real scenarios like enterprise decision-making and budget management, training environments are extremely scarce, and trial-and-error requires real money and incurs real costs—much like the dilemma in robotics: real data is too difficult to obtain, forcing reliance on simulation, yet gaps exist between simulation and reality. Do you think building higher-fidelity simulators is valuable for high-risk, high-cost Agent tasks?

Wang Zihan: I tend to examine this from an algorithm evolution perspective. Humans themselves possess few-shot learning capabilities. While building high-realism environments is certainly important, the real world is the perfect experimental field. Moreover, simulation environments are not zero-cost—cheap simulations diverge greatly from the real world, as evidenced by the robotics field. This forces us to solve the sample efficiency problem; current RL frameworks still have tremendous room for improvement. I previously used the Thinking Machines API—initially given a few hundred dollars in credits, they were completely exhausted before I finished even one round. RL runs 500 steps, with each step potentially generating millions of tokens, costing one to two dollars, which is extremely high.

API cost visualization

Methods that are hundreds or thousands of times more efficient than existing RL will certainly emerge in the future, allowing Agents to learn continuously and efficiently. We are still far from that ultimate Agent. Now, regarding environments versus algorithms: Environment design is essentially a trade-off—low-complexity environments cannot support Agents generalizing to real high-cost scenarios, while high-complexity environments require higher costs. Therefore, the breakthrough must lie in the evolution of Agent learning speed, with the core being reasoning—reasoning allows it to learn faster and faster, grasping more essential commonalities between different tasks.

Note: Wang Zihan is a Northwestern Computer Science PhD, with main research interests in Agent RL. He graduated from the Gaoling School of Artificial Intelligence at Renmin University of China in 2024, participated in DeepSeek-V2 research, and has research experience at Microsoft and NVIDIA. To date, he has published over 20 papers, with results appearing at ICLR, NeurIPS, EMNLP, CVPR, and other conferences, accumulating over 1,600 citations. He has received honors including ICCV 2025 SP4V Best Paper and NeurIPS 2025 LAW Outstanding Paper. He led and participated in developing multiple Agent training and evaluation frameworks including RAGEN, VAGEN, and MindCube, accumulating over 10,000 GitHub Stars. His work has received attention from Stanford HAI, MIT Tech Review, Forbes, Financial Times, and other outlets. His personal technical account on X has over 20,000 followers, with representative threads accumulating over one million views.

Please note that this interview content has been carefully edited and approved by Wang Zihan. We welcome readers to interact through comments and share your views on this interview. Z Potentials will continue to provide more interviews with frontline technical explorers in artificial intelligence, global markets, robotics, and other fields.

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.