Token-Level, Precision Length Control: 3B Model Beats GPT 5.4 and Claude

Reported by Synced Review

Editor: LRST

【Synced Review】LenVM elevates length modeling to the token level, opening a new dimension for scalable value pretraining. The 3B open-source model achieves precise length control, comprehensively surpassing top closed-source models like GPT-5.4 and Claude-Opus-4-6. Under the same token budget, reasoning accuracy improves tenfold (63% vs 6%). This value pretraining scales without saturation along three axes: model size, data volume, and number of samples.

Tokens are the fundamental computational unit of modern autoregressive models. Each one implies a forward pass, KV cache occupation, latency accumulation, and energy consumption. With the rise of long chain-of-thought (long-CoT) and agentic workflows, generation length directly impacts two things: it is a core variable of inference cost and it affects inference quality—more tokens provide more room for thought, but too many cause waste.

Existing length control methods are all too coarse. Sequence-level penalties during training give the model no awareness of "how much is left" during generation. Prompt-based instructions are essentially "pleading" with the model to comply without any hard constraint. Pre-decoding length predictors make a one-time judgment and cannot dynamically adjust afterward. Their common limitation is that they operate at the sequence level, while decoding itself happens token by token—no existing framework models remaining length at this granularity.

Looking deeper, value functions have long been proven in reinforcement learning as powerful tools for modeling "future return." However, length has never been treated as a value function—lacking both a supporting training paradigm and a verified scaling path.

Researchers from institutions like UC Santa Barbara and Apple have proposed the Length Value Model (LenVM), addressing two questions simultaneously:

① How to perform token-level length modeling?

It transforms generation length modeling into a value estimation problem in reinforcement learning: assigning a fixed negative reward to each generated token and discounting cumulatively yields a bounded, monotonic proxy signal for "remaining generation length." Thus, the model has a clear quantitative estimate of "how far is left" at every decoding step.

② How to achieve scalable value pretraining?

This construction naturally brings four properties highly favorable for large-scale pretraining: annotation-free, dense signals, unbiased, and scalable.

This means LenVM's training is essentially a self-supervised process—requiring no additional human annotation or reward models, continuously improving with just "feeding data," much like pretraining a language model.

Paper: https://arxiv.org/abs/2604.27039

Code: https://github.com/eric-ai-lab/Length-Value-Model

Project Page: https://length-value-model.github.io/

Demo: https://length-value-model.github.io/demo/index.html

Detailed Technical Solution

Core Idea: Turning Remaining Length into a Value Function

The core idea of LenVM is simple and elegant: treat generation length as a cost. By assigning a fixed negative reward to each token, the remaining length naturally becomes a value function prediction problem.

Specifically, for each non-terminal decoding step t, a fixed negative reward is assigned:

The corresponding discounted return is:

Where L is the total sequence length, and γ∈(0,1) is the discount factor. This return has three key properties:

Bounded: . No matter how long the sequence, the target value always stays within a fixed range.
Monotonic: The closer to termination, the closer is to 0; the more remaining, the closer to -1. The magnitude of the value directly encodes how far there is to go.
Bellman Consistent: Satisfies , fully conforming to the standard value function framework.

This defines a token-level TD residual , directly measuring how the current token changes the expectation of the remaining generation length—a signal that did not exist before.

Why not directly predict the raw number of tokens?

Generation length can range from a few tokens to 32k, featuring a vast dynamic range that is difficult to regress directly. The discounted return transformation maps the highly variable raw length to a fixed range (-1, 0) while maintaining strict monotonicity. The discount factor γ is a resolution adjustment knob: larger γ offers higher resolution early in generation, while smaller γ is finer near termination.

Scalable Value Pretraining: Annotation-Free, Three-Axis Scaling

This is the core advantage that distinguishes LenVM from all existing length control methods and arguably the most noteworthy aspect of this work.

The scale ceiling of traditional value models is locked by annotation cost and quality. LenVM completely bypasses these bottlenecks. The training objective consists of a token-level mean squared error:

Here, Monte Carlo regression is performed at every token position using the actual observed discounted remaining length . Supervisory signals are automatically generated from sampled completions, possessing four key properties:

Experiments validated LenVM's scaling along three axes simultaneously:

Model Size (0.5B → 32B): Larger models consistently yield lower validation loss.
Number of Training Prompts (10k → 100k): Broader data coverage continuously improves length modeling quality.
Samples per Prompt (n=1 → n=16): More completion trajectories provide stronger supervision.

Loss decreases monotonically across all three axes, indicating that LenVM as a value pretraining objective is well-posed: there is no data saturation, and more computational resources directly translate to stronger length modeling capabilities.

Three Inference-Time Applications and Experimental Results

How good is the token-level length signal learned by LenVM? The author team validated it through three inference-stage applications. Critically, none modify the base generation model.

Application 1: Precise Length Control

At each decoding step, LenVM predicts the next state's value for each candidate token and selects tokens accordingly: "Equal To" selects the token whose predicted value is closest to the target discounted return; "At Most" selects the token with the largest value (closest to 0) to guide early termination; "At Least" selects the token with the smallest value (closest to -1) to guide continued generation. This provides a genuine token-level hard constraint, not a coarse-grained "plea."

On the LIFEBench benchmark (covering QA, summarization, reasoning, and creative writing with 180 items each in Chinese and English), the Qwen2.5-3B + LenVM (1.5B) saw its length score jump from 25.6 to 62.6, with length deviation reduced from 83% to 56%. This significantly outperforms all closed-source models like GPT-5.4 (37.4), Claude-Opus-4-6 (35.5), and Gemini-3.1-Pro (49.3). The Qwen2.5-7B + LenVM combination improves further, reaching a score of 64.8 with a deviation of just 44%.

No matter how powerful, prompt-based coarse control has an inherent limit for closed-source models—LenVM provides precise constraints effective at every decoding step.

Application 2: Continuous Performance-Efficiency Trade-off

Using exponential tilting, LenVM softly reweights the base model's token distribution:

When , tokens expected to lead to shorter completions receive higher probability; when , it reverts to the base model. This acts as a continuous knob for smoothly trading off between reasoning quality and token consumption.

On GSM8K, with a token budget of 200: the hard truncation baseline achieves a Pass@1 of roughly 6%, while LenVM-guided decoding achieves a Pass@1 of roughly 63%—a tenfold difference. This result reveals an important fact: the base model inherently possesses the capability to solve problems using shorter paths but usually fails to select them. LenVM "excavates" these paths through fine-grained reweighting. On MATH500 and MathVista (visual math reasoning), LenVM also consistently outperforms the hard truncation baseline, smoothly tracing out a Pareto frontier as β varies.

Application 3: Generation Length Prediction

LenVM can predict the total generation length right from the prompt boundary (before the first response token is generated). This has direct value for inference system batch grouping, KV cache pre-allocation, and request priority scheduling—information currently only available after decoding completes. The 32B model achieves a Mean Relative Error (MRE) as low as 9.8% in the math domain, 14.9% in the code domain, and 17.1% in the instruction-following domain, with performance improving consistently with model scale.

Bonus Insight: Which Tokens "Extend" or "Conclude" Reasoning?

LenVM's token-level TD residual also provides a previously non-existent observational lens.

Reasoning-extending tokens () such as wait, but, ah, think, consider often correspond to reasoning pivots and reflections. Notably, "ah" frequently appears in "Aha Moments" like "Ah! I see the mistake."

Reasoning-concluding tokens () such as therefore, clearly, perfect, and closing markers like ✅ 🎉 correspond to answer confirmation and generation termination. LenVM is not just a control signal; it is also a new window into observing how models reason.

Conclusion

LenVM's contributions can be understood on two levels.

For Length Modeling: It advances control granularity from the sequence level to the token level, giving each decoding step a clear "awareness of remaining length." This breaks the common ceiling of all existing methods—whether prompt control, training penalties, or pre-decoding predictors, none can provide a dynamic, per-token constraint signal. A 3B open-source model beating GPT-5.4 and Claude-Opus-4-6 in precise length control is not about "crafting a good prompt," but because it genuinely possesses a token-level length signal for the first time.

For Scaling: With length as a value function, its training objective is inherently annotation-free, signal-dense, and scalable along three axes. The scaling laws are highly consistent with language model pretraining. This proves that generation length is a new dimension for scalable value pretraining—no extra annotation is needed; more computation and data directly lead to continuous improvement in length modeling capability.

Simultaneously, LenVM provides a length-specific value baseline for future RL training: it can serve as a dense advantage signal in PPO or improve credit assignment through potential-based reward shaping without altering the task objective.

Generation length shouldn't just be a number tallied after the fact. It should be a signal that the model perceives and weighs at every decoding step—just as it perceives semantics and syntax. LenVM makes this possible for the first time.

References:

https://arxiv.org/abs/2604.27039

Token-Level, Precision Length Control: 3B Model Beats GPT 5.4 and Claude

Related Articles

分享網址