Make Thinking More Accurate and Extended! The New Reinforcement Learning Algorithm FIPO Arrives

As models like OpenAI o1 demonstrate卓越 complex reasoning capabilities, Reinforcement Learning with Verifiable Rewards (RLVR) has gradually become a key technical path for enhancing large model performance. However, how exactly does reinforcement learning alter the model's internal "black box"? What bottlenecks do existing algorithms face?

Recently, the Qwen Pilot team from Alibaba's Tongyi Laboratory released a series of four technical blog posts, deeply analyzing the underlying mechanisms and potential limitations of reinforcement learning in large models.

Based on these insights, the team officially launched the new algorithm FIPO (Future-KL Influenced Policy Optimization). This solution ingeniously introduces the Future-KL mechanism, effectively solving the stubborn problem of 'reasoning length stagnation' in pure RL training. In a head-to-head competition with a 32B clean base model, FIPO demonstrated leading convergence limits without any SFT data warm-up or value model assistance. It achieved a performance 反超 against o1-mini and the similarly sized deepseek-zero-MATH in authoritative math evaluations, setting a new benchmark for medium-scale parameter pure RL training in the open-source community.

Theseus's Ship:

98% of the output remains unchanged;
The essence of RL is "sparse but critical" path guidance

After fine-tuning with reinforcement learning (RL), is the internal reasoning mechanism of a large model completely "rewritten"? To answer this dilemma akin to "Theseus's Ship," the team conducted a deep dissection of model behavior at the token level.

The research results broke the industry's inertial cognition: In the vast majority of generation steps, the performance of the RL model is highly consistent with the base model, with over 98% of the token distribution remaining virtually unchanged.

As shown in figures (a) and (b) above, the divergence of the model output distribution (JS Divergence) approaches zero for the vast majority of positions in the sequence, with pulse-like surges occurring only in extremely few positions.

This means that RLVR (Reinforcement Learning with Verifiable Rewards) does not create entirely new global capabilities out of thin air for the model; its optimization mechanism presents a characteristic of being "extremely sparse but critical." As shown in the trajectory visualization in figure (c), the role of RL is more like a precise "lane-changing switch": it implements micro-adjustment interventions only at a few key logical decision points (RL edits), guiding the model onto a correct reasoning trajectory that the base model originally possessed but found difficult to maintain throughout the entire process.

Locating Key Points (Direction):

Breaking through the amplitude blindness,
Precisely locking the reasoning core with "directionality"

Since the key tokens triggering the "butterfly effect" are so sparse, how can researchers precisely locate them among tens of thousands of outputs?

In further exploration, the team found that traditional evaluation metrics commonly used in the industry (such as Entropy and KL Divergence) have natural observational blind spots: they can only measure "how much change" occurred (amplitude) in the model, but cannot reveal the specific morphology of the change.

As shown in figure (a) and histogram (b) above, under the observation of traditional metrics, the output distributions of the base model and the RL model overlap highly, still resembling "looking for a needle in a haystack." However, when the team abandoned pure amplitude metrics and introduced a new dimension called Symbolic Log-Probability Difference (Δ log P), the internal logic of RL became instantly clear.

Δ log P can precisely capture the directionality of optimization—it can clearly quantify whether the RL algorithm is "encouraging" (positive) or "suppressing" (negative) the generation of a specific token. The token replacement experiment in figure (c) above provides the most convincing proof: the key decision points screened out by Δ log P can restore complete RL reasoning performance with an extremely low replacement ratio, far exceeding the accuracy of KL divergence and entropy.

Grasping this law of directionality, the team not only locked in the core hubs affecting the reasoning trajectory fastest but also 开辟 ed a new path in engineering: during the testing phase, by directly amplifying these key decisions along the Δ log P direction, the model can significantly improve its accuracy in solving mathematical problems without any additional training.

Revealing Long Reasoning Hidden Dangers (Secrets):

Frequent "Oops Moments" and the limitations of global reward mechanisms

After mastering how to locate the key optimization directions, the team turned their attention to the specific behaviors during the model's long reasoning process and made an alarming discovery. The industry generally expects large models to correct previous errors through self-reflection (the so-called "Aha Moment"). However, when deeply analyzing massive amounts of long chain-of-thought data, the team observed a completely opposite and highly destructive phenomenon—the "Self-Misleading" (Oops Moment). Its typical manifestation is: the model has already successfully derived the correct intermediate steps or final answer, but unexpectedly triggers an extra "self-reflection" sequence, forcibly overturning the correct conclusion and leading to an erroneous final output.

This phenomenon is common in complex mathematical reasoning. As shown in the typical case below (Table 1): at step 108, the model had accurately calculated the target result (3507). But immediately after, it generated a redundant self-doubt ("Wait, let me double check..."), was then completely led astray by its own newly constructed erroneous logic, and finally arrived at an absurd wrong answer (15).

Is this situation an occasional isolated case or a universally existing systematic defect? Through strict statistics on massive zero-base reinforcement learning verification data, the team revealed a harsh reality:

Throughout the entire training cycle, positive "Aha Moments" are extremely rare, accounting for only about 1%. In sharp contrast, the occurrence rate of destructive "Oops Moments" remains steadily high at nearly 3%, with a frequency almost three times that of "Aha Moments."

Why does the model frequently undergo this "reverse optimization"?

The team pointed out that the root cause lies in the defect of Coarse-Grained Credit Assignment existing in current mainstream reinforcement learning algorithms (such as standard GRPO). The traditional global reward mechanism adopts uniform advantage distribution; as long as the final result is correct, the system distributes the same reward evenly to all tokens on the entire chain of thought.

This mechanism cannot distinguish which are true key logical advancements and which are meaningless redundant reflections, leading to the model lacking clear local perception of right and wrong. Over time, the model极易 gets lost in long-sequence reasoning, ultimately falling into the performance bottleneck of "reasoning length stagnation."

Reshaping Credit Assignment:

FIPO introduces "Future Influence" to unlock deep reasoning potential

Combining the above three insights, a technical path to break the current reasoning bottleneck has become clear: Since the leap in large model reasoning capabilities relies on precise "lane-changing" at extremely few key decision points (as shown in insights 1 and 2), while the traditional global reward mechanism, due to its coarse-grained allocation method, cannot identify these key points and even inevitably fuels high-frequency "Oops" disasters in long sequences (as shown in insight 3); then, the new generation of reinforcement learning algorithms must achieve a leap from "global uniform rewards" to "token-level precise credit assignment."

To thoroughly tackle this credit assignment problem, the team officially proposed the new FIPO (Future-KL Influenced Policy Optimization) algorithm.

Compared to traditional GRPO algorithms that highly rely on binary result feedback (ORM) at the end of the trajectory and average out rewards, FIPO reconstructs the optimization paradigm from the underlying logic. It innovatively introduces the core metric of "Future Influence," aiming to track and quantify in real-time the causal effect of every generated token on the direction of the subsequent entire reasoning trajectory.

Core Mechanism:

Introducing "Future-KL" to achieve token-level precise evaluation

The FIPO algorithm no longer highly relies on binary result feedback (ORM) that can only be settled at the end of the trajectory, but innovatively introduces the Future-KL estimation mechanism aimed at capturing causal influence. During the process of the model generating a chain of thought, FIPO keenly tracks the probability shift triggered by every token.

Based on this, the overall causal influence of the current token on the future can be defined as the accumulation of probability shifts in the subsequent trajectory:

According to the feedback of this indicator, the algorithm achieves precise guidance of local reasoning trajectories:

Positive Reinforcement: When FutureKL_t > 0, it indicates that the updated policy plays a reinforcing role on the entire subsequent trajectory. The currently generated token is regarded as a "stable anchor" in the reasoning chain; the algorithm will increase its weight, prompting the model to follow this effective path in the future.
Negative Suppression: Conversely, when FutureKL_t < 0, it indicates that the policy is collectively suppressing future token generation. This means the reasoning trajectory derived from this point is becoming less favored by the model; the algorithm will reduce the weight of this branch, thereby effectively preventing the continuation of inefficient or erroneous trainsof thought.

Engineering Robustness Guarantee:

Three major mechanisms suppress training instability

In actual training, unconstrained Future-KL will amplify the variance brought by distribution shifts, easily leading to gradient explosion and catastrophic training collapse. FIPO's engineering advantage lies in the fact that the Qwen Pilot team designed three major stability mechanisms for it, ensuring smooth optimization progress:

Extreme Value Filtering: The algorithm explicitly shields tokens with extreme update fluctuation advantages, thereby eliminating the main source of instability in training without changing effective reasoning signals.
Soft Decay Window: Innovatively introduces a discount factor (γ) to simulate the diminishing effect of causal influence. This mechanism prompts the model to prioritize local logical coherence while smoothly filtering out accumulated noise from the distant future.
Influence Weight Clipping: Strictly limits the influence weight (⨍_t), which serves as the advantage multiplication coefficient, within a preset safe interval, completely preventing numerical collapse caused by extreme probability shifts.

Breaking the Length Bottleneck:

Substantial leap in 10,000-character deep reasoning and accuracy

To verify the effectiveness of this solution, the research team evaluated it on the Qwen2.5-32B-Base model, which had never been exposed to Long-CoT data. Experimental data demonstrated FIPO's breakthrough progress in solving complex mathematical reasoning:

Breaking Length Stagnation: Traditional baseline algorithms often get stuck in length stagnation after generating about 4,000 tokens. Under FIPO's positive incentive, the model's average reasoning length was successfully pushed to over 10,000 tokens.
Achieving Accuracy Scaling: Experiments clearly verified that "increased length is not redundant generation"—under FIPO's guidance, the increase in response length shows a strong positive correlation with the model's actual problem-solving accuracy, truly achieving meaningful deep thinking.

In the most challenging AIME 2024 math benchmark test, FIPO successfully broke through the 50.0% performance bottleneck of the baseline DAPO algorithm, pushing the accuracy strongly to a new high of 58.0%. This marks FIPO as the first open-source solution under the 'Pure RL' setting with 32B equivalent parameter scale and zero external long chain-of-thought priors to successfully cross the performance gap of o1-mini.

In addition, monitoring indicators also confirm the health of FIPO's training process: smoothly rising Policy KL, consistently low Gradient Norm, and Entropy that maintains exploration jointly indicate that the model is steadily expanding its reasoning space rather than falling into mechanical local fitting.

Welcome to view

Paper:
https://arxiv.org/pdf/2603.19835

GitHub:
https://github.com/qwenpilot/FIPO

Model:
https://huggingface.co/QwenPilot/FIPO_32B
https://modelscope.cn/models/chiyum609/FIPO_32B

Training Curves:
https://swanlab.cn/@QwenPilot/FIPO

Job Invitation

Alibaba Tongyi Laboratory - QwenPilot - Recruiting Large Language Model Algorithm Experts/Interns

Team Introduction:
The QwenPilot team belongs to Alibaba's Tongyi Laboratory, dedicated to solving long-term and fundamental challenges in the development of large language models. Our mission is to build the next generation of artificial intelligence systems with general intelligence, enabling models to truly possess deep reasoning, planning, and complex problem-solving capabilities. We pursue models that can generalize across tasks and domains and demonstrate reliable and profound intelligence in various real-world scenarios. Meanwhile, we are fully advancing frontier 探索 ation of autonomous agents (AI Agents), empowering models with strong decision-making and execution capabilities in dynamic environments.

Work Locations: Beijing & Hangzhou & Seattle
Positions: Algorithm Expert, Intern

Main Research Directions:

Exploration and Evolution of Frontier Foundation Models: Participate in the R&D and iteration of next-generation large models, tackle frontier technical bottlenecks, and promote capability leaps and boundary breakthroughs in general intelligence.
Model Capability Evaluation and Defect Diagnosis: Construct systematic evaluation methods and metrics to precisely characterize model capability boundaries; simultaneously identify capability shortcomings in key tasks and deeply analyze failure modes and their root causes.
Training Mechanism Exploration and Problem Analysis: Deeply research the core mechanisms and potential bottlenecks of large-scale model training to provide solid theoretical guidance for improving and evolving training paradigms.
Design and Exploration of Better Training Paradigms: Promote the evolution of training methodologies through practice to build stronger, more reliable, and more intelligent models.
Exploration and Optimization of Agentic RL Algorithms and Architectures: Deeply research the application of reinforcement learning in complex multi-step reasoning and decision-making environments, improving model performance in dimensions such as long-range planning, tool calling, and self-reflection, and stimulating the model's self-exploration and evolution capabilities.

Qualifications

Basic Requirements:

Graduated from top global universities, majoring in Computer Science, Artificial Intelligence, Machine Learning, Deep Learning, Software Engineering, Mathematics, Physics, or related fields; PhD/Master's preferred.
Practical experience in LLM systems, pre-training, Post-training (SFT/RL), or evaluation.
Solid practical experience in AI Agent architecture development or Agentic RL fields, familiar with decision-making mechanisms of agents in complex interactive environments.
Proficient in deep learning frameworks such as PyTorch or JAX, and possess solid software engineering capabilities.
Possess excellent self-learning abilities and self-drive, with a strong desire to explore and curiosity about frontier fields. Good at independent thinking, reflection, and summarization; possess good communication skills and team collaboration spirit.

Plus Points:

Deep understanding and proficient use of mainstream LLM reinforcement learning and training frameworks, such as veRL or Slime.
Familiarity with underlying inference and deployment frameworks, such as vLLM or SGLang.
Rich development and training experience in Multi-Agent collaboration or Agentic RL in complex environments.
Published influential papers at top conferences/journals such as NeurIPS, ICML, ICLR, ACL, etc.
Significant contributions to well-known open-source projects, significant influence in the open-source community, or rich LLM development and training experience.

Resume Submission: guoyin.wang@alibaba-inc.com

Follow me to grasp the latest dynamics of the Qwen large model

Make Thinking More Accurate and Extended! The New Reinforcement Learning Algorithm FIPO Arrives

Related Articles

分享網址