Lin Junyang Speaks Out for the First Time After Leaving Alibaba: Reviewing Qwen's Detours, Pointing to AI's New Path

By Meng Chen, from Aifei Temple | QbitAI

Lin Junyang has broken his silence for the first time since leaving Alibaba's Qwen team.

Rather than addressing the departure controversy or announcing his next move, he published a long essay exploring the transition from "Reflections on the Reasoning Model Era" to "Reflections on the Agent Era."

Lin Junyang's post

While the entire article discusses technology and AI's future direction, the text cannot hide reflections on Qwen's technical route.

He candidly admits in the essay: "We did not get everything right."

We did not get everything right

The Qwen team once had an ambitious vision: merging the "thinking" and "instruct" modes into a single model.

Qwen3 was "one of the clearest public attempts" in this direction, introducing a hybrid thinking mode.

But Lin Junyang remains unsatisfied today. He feels that ultimately, "thinking" became verbose and hesitant, while "instruct" became less crisp, less reliable, and more expensive.

In his view, true successful merging is not stuffing two personas into one checkpoint, but giving the model a continuous spectrum of reasoning effort.

Continuous reasoning spectrum

Facing the future, he offers this judgment: the mission of the past two years' Reasoning Thinking era has been completed.

OpenAI's o1 and DeepSeek-R1 proved reasoning capabilities can be trained and reproduced, teaching the industry a crucial insight:

If you want to do reinforcement learning on language models at scale, you need deterministic, stable, and scalable feedback signals.

Since the first half of 2025, almost everyone has been studying the same things: how to make models spend more inference time, how to train stronger rewards, how to expose or control these extra reasoning efforts.

Now the crucial question is: what's next?

Lin's answer is Agentic Thinking—agent-style thinking that constantly revises plans through interaction with the environment.

He lists the key differences between Agentic Thinking and Reasoning Thinking:

Determining when to stop thinking and start acting. Reasoning models finish after outputting an answer, while agents must constantly switch between thinking and acting.

Choosing which tools to call and in what order. This is not simple function calling, but a dynamic programming problem.

Digesting noisy and partial observations from the environment. The real world won't give you perfect feedback.

Revising plans after failure, rather than starting over from scratch.

Maintaining coherence across multiple dialogue turns and multiple tool calls.

He summarizes in one sentence:

From "thinking longer" to "thinking to act."

In Lin Junyang's view, future competitiveness will come not just from better models, but from better environment design, stronger harness engineering, and orchestration between multiple agents.

From training models, to training agents, to training systems.

(The following is a translation of Lin Junyang's original text.)

From "Reasoning Thinking" to "Agentic Thinking"

The past two years have redefined how we evaluate models and what we expect from them.

OpenAI's o1 showed that "thinking" could become a first-class capability—something you specifically train for and expose to users.

DeepSeek-R1 proved that reasoning-oriented post-training could be reproduced and scaled outside the original lab.

OpenAI described o1 as a reasoning product line trained with RL to "think before answering," while DeepSeek positioned R1 as an open-source reasoning model competitive with o1.

DeepSeek-R1 vs OpenAI o1 comparison

That phase was important.

But in the first half of 2025, the industry devoted most of its energy to Reasoning Thinking: how to make models spend more compute during inference, how to train with stronger rewards, how to expose or control these extra reasoning efforts.

Now the question is: what's next?

I believe the answer is Agentic Thinking—thinking for the sake of action, thinking through interaction with the environment, and continuously updating plans based on feedback from the real world.

1. What the Rise of o1 and R1 Really Taught Us

The first wave of reasoning models taught us:

If we want to scale reinforcement learning on language models, we need deterministic, stable, and scalable feedback signals.

Mathematics, code, logic, and other verifiable domains become crucial because reward signals in these scenarios are far stronger than general preference supervision.

They allow RL to optimize for correctness rather than just "seeming reasonable." Infrastructure becomes crucial.

Once models are trained to reason over longer trajectories, RL is no longer a lightweight add-on to supervised fine-tuning (SFT); it becomes a systems engineering problem.

You need large-scale trajectory sampling (rollout), high-throughput verification, stable policy updates, and efficient sampling.

The rise of reasoning models is a story of modeling, but equally a story of infrastructure.

OpenAI described o1 as a reasoning product line trained with RL; DeepSeek R1 subsequently validated this direction further—it showed how much specialized algorithm and infrastructure work reasoning-based RL requires.

The first major shift: from scaling pre-training to scaling inference-oriented post-training.

2. The Real Problem Was Never Just "Merging Thinking and Instruction"

In early 2025, many of us on the Qwen team had an ambitious blueprint in mind.

The ideal system should unify thinking mode and instruction mode. It should support adjustable reasoning intensity, similar to low/medium/high reasoning gear settings.

Ideally, it should automatically infer appropriate reasoning effort from prompts and context—letting the model decide when to answer directly, when to think a bit longer, and when to invest heavy computation on truly difficult problems.

Conceptually, this direction was correct. Qwen3 was one of the clearest public attempts in this direction.

It introduced a "hybrid thinking mode," supporting both thinking and non-thinking behaviors within one model family, emphasizing controllable thinking budgets, and designing a four-stage post-training pipeline—explicitly including a "mode fusion" step after long CoT cold-start and reasoning RL.

Alibaba's Qwen3: Open-weight LLMs with hybrid thinking

But merging is easier said than done; the real difficulty lies in data.

When people talk about merging thinking and instruction, they often first think about model-side compatibility: can one checkpoint support both modes, can one dialogue template switch between them, can one inference serving architecture expose the right switches.

The deeper problem is that the two modes have fundamentally different data distributions and behavioral objectives.

In trying to balance model merging with improving the quality and diversity of post-training data, we did not get everything right.

During iterations, we also closely monitored how users actually employed thinking and instruction modes. A strong instruction model is typically rewarded for being concise and direct, having canonical formats, and low latency—especially for repetitive, high-volume enterprise tasks like rewriting, annotation, templated support, structured extraction, and operational QA.

A strong thinking model is rewarded for spending more tokens on hard problems, maintaining coherent intermediate reasoning structures, exploring alternative paths, and preserving enough internal computation to genuinely improve final accuracy.

These two behavioral patterns pull in opposite directions.

If merged data is not carefully curated, the result is usually mediocre on both sides: "thinking" behavior becomes noisy, bloated, or indecisive, while "instruction" behavior becomes less crisp, less reliable, and more expensive than business users actually need.

In practice, keeping them separate remains more attractive.

In the second half of 2025, following Qwen3's initial hybrid framework, version 2507 released separate Instruct and Thinking updates, including separate 30B and 235B versions.

In commercial deployments, many customers still need high-throughput, low-cost, highly controllable instruction behavior for batch operations. For these scenarios, the benefits of merging are not obvious. Separating product lines allows teams to focus more intensively on solving data and training problems for each respective mode.

Other labs chose the opposite route.

Anthropic publicly advocated for an integrated model philosophy: Claude 3.7 Sonnet was launched as a hybrid reasoning model where users could choose between normal response and extended thinking, with API users able to set thinking budgets. Anthropic made clear they believe reasoning should be an integrated capability, not a separate model.

GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, programming, and agent capabilities.

DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid reasoning.

The key question is whether such merging is organic.

If thinking and instruction are merely stuffed into the same checkpoint but still operate like two awkwardly spliced personas, the product experience remains unnatural.

Truly successful merging requires a fluid continuous spectrum of reasoning effort. The model should be able to express multiple levels of reasoning intensity, ideally making adaptive choices.

GPT-style intensity control points in this direction: a strategy about compute allocation, not an either-or switch.

3. Why Anthropic's Direction Is a Useful Correction

Anthropic's public messaging around Claude 3.7 and Claude 4 has been restrained.

They emphasize integrated reasoning, user-controllable thinking budgets, real-world tasks, programming quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 built on this further, allowing reasoning and tool use to alternate, while Anthropic emphasizes programming, long-running tasks, and agent workflows as primary goals.

Producing longer reasoning trajectories does not automatically make a model smarter.

In many cases, excessive visible reasoning is precisely a signal of inefficient compute allocation. If a model tries to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress information, or failing to take action.

Anthropic's trajectory implies a more disciplined perspective: thinking should be shaped by target workloads.

If the goal is programming, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the goal is agent workflows, then thinking should improve execution quality over long horizons, rather than producing flashy intermediate text.

This emphasis on targeted utility points to the bigger picture:

We are moving from an era of training models to an era of training agents.

We explicitly wrote this in the Qwen3 blog—"We are moving from an era focused on training models to one where training agents is central"—and linked future RL progress to environmental feedback for long-horizon reasoning.

Qwen3 blog excerpt

An agent is a system capable of formulating plans, deciding when to act, using tools, perceiving environmental feedback, revising strategy, and running continuously over long periods. Its defining characteristic is closed-loop interaction with the world.

4. What "Agentic Thinking" Actually Means

Agentic thinking represents a different optimization objective.

Reasoning thinking is typically measured by the quality of internal reasoning before the final answer: can the model solve theorems, write proofs, generate correct code, pass benchmarks. Agentic thinking asks: can the model make continuous progress through interaction with the environment?

The core question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking must handle several things that pure reasoning models can largely avoid:

Deciding when to stop thinking and take action.

Choosing which tools to call and in what order.

Digesting noisy or partial observations from the environment.

Revising plans after failure.

Maintaining coherence across multiple interactions and multiple tool calls.

Agentic thinking is a model that reasons through action.

5. Why Agent RL Infrastructure Is Harder

Once the optimization target shifts from solving benchmark problems to solving interactive tasks, the RL tech stack must change too; classic reasoning RL infrastructure is no longer sufficient.

In reasoning RL, you can usually treat sampled trajectories as largely self-contained sequences with relatively clean evaluators.

In agent RL, the policy is embedded in a larger harness framework: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and scheduling frameworks.

The environment is no longer a static validator; it becomes part of the training system itself.

This creates a new systems requirement: training and inference must be decoupled more thoroughly.

Without this decoupling, sampling throughput collapses.

Imagine a coding agent that needs to execute generated code on a live testing framework: the inference side stalls waiting for execution feedback, the training side "starves" for lack of completed trajectories, and the entire pipeline's GPU utilization falls far below what you'd expect for classic reasoning RL.

Add tool latency, partial observability, and stateful environments, and these inefficiencies are amplified. The result is that experiments become painfully slow long before you reach target capability levels.

The environment itself becomes a first-class research object.

In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, authenticity, coverage, difficulty, state diversity, feedback richness, cheat-resistance, and scalability of trajectory generation.

Building environments is becoming a real entrepreneurship track, not a side project. If agents are trained to run in production-like environments, then the environment is part of the core competency stack.

6. The Next Frontier Is More Accessible Thinking

My expectation is that agentic thinking will become the mainstream form of thinking.

I think it may eventually replace most of the old-style "static monologue" reasoning thinking—those overly long, isolated internal reasoning trajectories that try to compensate for lack of interaction by outputting more and more text.

Even for very difficult math or programming tasks, a truly advanced system should have the right to search, simulate, execute, check, verify, and revise. The goal is to solve problems robustly and efficiently.

The biggest challenge in training such systems is reward hacking.

Once models gain meaningful tool access, reward hacking becomes far more dangerous.

A model that can search might learn to directly search for answers during RL training. A coding agent might exploit future information in code repositories, abuse logs, or discover shortcuts that void tasks. An environment with hidden leakage might make a policy appear superhuman when it's actually just being trained to cheat.

This is where the agent era becomes far more subtle than the reasoning era.

Better tools make models more useful, but also expand the attack surface for spurious optimization.

We should expect that the next batch of serious research bottlenecks will come from environment design, evaluator robustness, cheat-proofing protocols, and more principled interfaces between policies and the world. Nevertheless, the direction is clear. Tool-empowered thinking is more useful than isolated thinking, and more likely to genuinely improve productivity.

Agentic thinking also implies the rise of harness engineering. Core intelligence will increasingly come from how multiple agents are organized:

An orchestrator responsible for planning and task distribution, multiple specialist agents acting like domain experts, and sub-agents executing narrower tasks—helping control context, avoid information pollution, and maintain isolation between different levels of reasoning.

The future direction is: from training models to training agents, from training agents to training systems.

Conclusion

The first phase of the reasoning wave established something important:

When feedback signals are reliable and infrastructure can support it, RL on top of language models can produce qualitatively leapfrogging cognitive capabilities.

The deeper shift is from reasoning thinking to agentic thinking:

From thinking longer to thinking to act. The core object of training has changed—it has become the model plus environment system, or more specifically, the agent and the harness framework around it.

This changes which research elements matter most:

Model architecture and training data certainly still matter, but environment design, trajectory sampling infrastructure, evaluator robustness, and coordination interfaces between multiple agents are equally crucial.

This also changes the definition of "good thinking":

The most useful trajectories are those that can sustain effective action under real-world constraints—not the longest or flashiest ones.

This also changes the source of competitive advantage:

In the reasoning era, advantage came from better RL algorithms, stronger feedback signals, and more scalable training pipelines.

In the agent era, advantage will come from better environments, tighter training-inference coupling, stronger harness engineering, and the ability to close the loop between model decisions and the consequences of those decisions.

Original URL: https://x.com/JustinLin610/status/2037116325210829168?s=20

— End —

Lin Junyang Speaks Out for the First Time After Leaving Alibaba: Reviewing Qwen's Detours, Pointing to AI's New Path

Related Articles

分享網址