Not All Tokens Are Equal! Google Proposes True Deep Thinking: Long Chain of Thought ≠ Deep Reasoning

图片

Source | Quantum Bit

Author | Wen Le

It turns out AI also pads word counts.

Does a longer chain of thought in large models mean stronger reasoning capabilities? Google says No—

The number of tokens and the quality of reasoning have no positive correlation because not all tokens are the same; some are just filler, and only deep thinking tokens are truly useful.

A new study discards the "word count" theory and introduces a new standard for measuring model reasoning quality called DTR, specifically designed to catch whether a model is truly thinking or just padding.

图片

Based on DTR, the study also proposes the Think@n strategy, enabling reasoning models like GPT-OSS and DeepSeek-R1 to maintain accuracy while cutting computing costs by half.

Long Logic Does Not Equal Good Reasoning

For a long time, a common belief has been that the longer the chain of thought, the better.

The logic behind this is straightforward: more reasoning steps equal more thorough thinking, which equals more accurate answers.

Consequently, many developers have started piling on computing power to pursue longer reasoning trajectories.

Google's research team tested eight model variants, including GPT-OSS, DeepSeek-R1, and Qwen3, across four datasets: AIME2024/2025, HMMT 2025, and GPQA-Diamond.

The results showed that the average correlation coefficient between token length and accuracy was -0.54... a negative correlation.

图片

In other words, in some cases, the longer the chain of thought, the more likely the reasoning is to go off track, even falling into logical dead loops or over-reasoning.

So the question arises: If length cannot be relied upon, how do we judge whether a model is truly thinking?

Google's perspective this time is quite interesting; instead of looking at surface output, they directly monitor the model's internal workings at every layer.

Research found that tokens generated by models can actually be divided into two categories:

  • Functional words: Such as "and", "is", "of". The model determines these quickly in shallow networks; they are perfunctory words that do not require deep thinking.
  • Deep thinking words: Such as "the calculation result is 10" or "Option A". These words are repeatedly corrected in deep networks, with their prediction distributions continuing to change, reflecting that the model is truly pondering the problem.

The team used JSD (Jensen-Shannon Divergence) to measure the difference in prediction distributions across layers. If a token's prediction only stabilizes in deeper networks, it is judged as a deep thinking token.

图片

Building on this, they proposed the Deep Thinking Ratio (DTR), which is the proportion of deep thinking tokens in the complete generated sequence.

The higher this ratio, the more the model focuses on core reasoning without consuming computing power on meaningless content.

True Deep Thinking Reduces Costs and Increases Efficiency

Across four reasoning test sets, the correlation coefficient between DTR and reasoning accuracy reached 0.82.

Compared to the -0.54 for token length, DTR better reflects reasoning quality.

图片

Google also launched the Think@n strategy based on DTR, which can identify low-quality nonsense early in the reasoning process and concentrate computing resources on samples that truly have depth.

Specifically, it samples multiple reasoning instances for each question, quickly estimates the DTR value using a short prefix of only 50 tokens, filters out the top 50% of high-quality samples, and then performs majority voting to derive the answer.

In this way, low-DTR, low-quality samples have their generation terminated early in the reasoning process, directly cutting out meaningless token consumption.

In tests across multiple mainstream models, Think@n's reasoning accuracy was equal to or slightly higher than traditional strategies.

For example, GPT-OSS-120B-medium achieved an accuracy of 94.7% on the AIME 2025 dataset, higher than the traditional strategy's 92.7%.

It also reduced computing costs by nearly half, with reasoning token consumption dropping from 355.6k to 181.9k, achieving unchanged performance with halved costs.

图片

The first author of this study, Wei-Lin Chen, is a Computer Science PhD from the University of Virginia, focusing on LLM reasoning measurement and evaluator validity, and previously served as a student researcher at Google.

图片

Co-first author Liqian Peng is an alumnus of the University of Science and Technology of China and currently serves as a Research Engineer at Google.

图片

Supervising author Yu Meng is an Assistant Professor of Computer Science at the University of Virginia. Her research directions include training paradigms, data and reasoning efficiency, and representation foundations. She has previously collaborated with top NLP scholar Danqi Chen.

图片

It seems large model reasoning can no longer rely on padding word counts; only true deep thinking can reduce costs and increase efficiency.

Paper address: https://arxiv.org/abs/2602.13517


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.