Meta's Dr. Zero: A Self-Evolving Agent Framework for Complex Reasoning and Search Without Training Data

Self-evolving intelligent agents (Agents) have seen new progress.

Recently, Meta's Superintelligence Lab and the University of Illinois at Urbana-Champaign (UIUC) jointly proposed the Dr. Zero framework, enabling agents to achieve efficient self-evolution under zero training data conditions.

According to the introduction, this framework addresses the challenges faced by multi-turn search agents in data-free self-evolution, such as "limited problem diversity" and the need for "substantial computational resources for multi-step reasoning and tool use."

The research team innovatively proposed the "Hierarchical Relative Policy Optimization" (HRPO) method. By clustering structurally similar problems to build robust group-level benchmarks, it ensures training effectiveness while avoiding the expensive nested sampling requirements during the self-evolution process.

Experiments show that in complex question-answering tasks, the framework outperformed fully supervised baselines by up to 14.1% without requiring any human-annotated data, demonstrating the strong potential of search-enhanced models in advanced reasoning tasks.

Meanwhile, without any human-annotated data, through reasonable architectural design and reward mechanisms, the intelligent agent can fully spontaneously emerge complex reasoning and search capabilities. This provides a new direction for solving model training problems in data-scarce environments in the future.

The Data Scarcity Challenge in AI Self-Evolution

Training a powerful model typically requires massive, high-quality human-annotated data. Especially in tasks involving complex reasoning and multi-step searches, obtaining precise annotated data is not only time-consuming but also extremely costly. Although the concept of "adaptive language agents" has been proposed for a long time, aiming to allow models to improve performance through iterative learning, existing mainstream methods still struggle to achieve true self-evolution. They still heavily rely on a large number of human-crafted problems or labels as prompts to drive exploration. This dependence on human intervention limits AI's ability to explore unknown boundaries.

To break through this limitation, the academic community has begun to explore data-free self-evolution, allowing models to autonomously generate problems and solve them, thereby building synthetic training data. However, moving from the lab to real-world applications also faces enormous challenges.

An ideal self-evolution framework would allow AI to achieve a spiral of performance improvement through proposer-solver co-evolution without any annotated datasets.

Figure | Adaptive training framework (Huang et al., 2025a), minimizing supervised iterative training of proposer and solver.

Most current self-evolution research focuses on specific fields with clear definitions and closed rules, such as mathematics and programming. In these fields, even with limited data diversity, models can make good progress.

However, once entering the open domain, the situation becomes completely different. Models tend to generate simple single-hop problems, lacking challenge. Performing multi-step reasoning and using search tools requires enormous computational resources. If models are allowed to optimize through extensive blind trial and error, the computational cost becomes unbearable.

Therefore, how to enable AI to perform high-quality self-evolution efficiently in complex open worlds, without relying on human data, is the core challenge Dr. Zero attempts to solve.

Dr. Zero: A "Zero-Data" Self-Evolution Learning System

Dr. Zero is not just a model, but a learning system capable of self-improvement. Its core design mainly includes three aspects.

1. Proposer-Solver Co-Evolution

The framework includes two core roles—the proposer and the solver. Both are played by large language models and co-evolve during the training process.

Figure | Dr. Zero self-evolution feedback loop. Guided by the solver's feedback, the proposer synthesizes verifiable and challenging queries, continuously enhancing the solver's search and reasoning capabilities.

The proposer's task is not only to generate problems but also to actively explore open-domain information using external search engines, generating diverse and structurally complex questions. More importantly, as training progresses, the proposer optimizes its own strategy based on rewards, generating new problems that are more complex, challenging, but verifiable.

The solver's task is to attempt to use external search engines to obtain information and answer these questions. It trains based on the synthetic problems generated by the proposer, continuously optimizing its own reasoning logic and search tool usage capabilities. As the solver's level improves, it will in turn push the proposer to find more tricky angles to generate new problems.

Figure | Evolution of iterative reward dynamics between proposer and solver in Dr. Zero. The baseline reward value decreases with iteration, reflecting the co-evolution between models: when one model's performance improves, it naturally lowers the other model's initial reward threshold, thereby promoting continuous self-optimization through reinforcement learning mechanisms.

2. Hierarchical Relative Policy Optimization (HRPO)

When allowing AI to self-evolve, the biggest obstacle is often computing power. Traditional reinforcement learning methods (like GRPO) require "nested sampling" to accurately evaluate the quality of a problem—generating multiple questions for the same prompt. HRPO cleverly solves this problem.

Traditional methods are computationally intensive and have unstable global benchmark evaluations when facing open problems with diverse structures. HRPO clusters problems with similar structures (e.g., complexity based on the number of "hops" in reasoning steps) to build group-level benchmarks. This means the model no longer needs to generate many duplicate questions for each prompt for testing; it only needs to generate a single question per prompt. By comparing its performance with other questions in the same group, it can obtain robust evaluation results. This directly avoids expensive nested sampling, significantly reducing computational costs while ensuring training effectiveness.

3. Difficulty-Guided Reward Mechanism

How to make the proposer generate high-quality difficult problems? Dr. Zero adopts a fine-grained difficulty-guided reward mechanism.

The reward mechanism design incentivizes the proposer to generate complex, multi-hop, difficult queries that can be verified via search engines, rather than simple single-hop problems. It not only encourages problems to be challenging but also ensures that the answers can be objectively verified using information returned by search engines, avoiding generating open or subjective problems that cannot be evaluated.

Dr. Zero, as a scalable and efficient framework, improves the proposer and solver through data-free self-evolution iterations. In each iteration, the proposer generates a batch of question-answer pairs with heterogeneous hop structures. Using the solver's feedback, the proposer optimizes through HRPO to generate verifiable, diverse, and challenging queries. Meanwhile, the solver uses GRPO to leverage the generated data to improve its search and reasoning capabilities. This alternating optimization loop forms a symbiotic feedback mechanism: as the solver's capabilities improve, the returns for simple queries gradually decrease, forcing the proposer to explore more complex reasoning paths to maximize returns.

Data-Free Evolution, Outperforming Supervised Methods with Data

To comprehensively evaluate Dr. Zero's search and reasoning capabilities, experiments covered various scenarios in open-domain question answering, building a widely covered benchmark test system.

This includes single-hop tasks like NQ (Natural Questions) and TriviaQA, which mainly test the model's precise retrieval and answering capabilities based on single facts; and multi-hop complex tasks like HotpotQA, MuSiQue, and 2WikiMQA, which require the model to perform multi-turn searches, information synthesis, and coherent reasoning, posing extremely high challenges to the agent's interaction and deep understanding capabilities.

Figure | Dr. Zero performance with different generated problem distributions.

Based on the above evaluation, the research team drew the following conclusions:

1. Performance is comparable to or even surpasses supervised baselines.

After multiple rounds of self-evolution, Dr. Zero's performance on multiple open-domain question-answering benchmarks is comparable or superior to fully supervised search agent baselines (like Search-R1) trained with human-annotated data. For example, it achieved up to a 14.1% performance improvement in some tasks. The experimental results prove that the performance level achieved by data-free evolution is reliable and robust.

2. Far surpasses other data-free baselines.

Compared to existing data-free methods (such as self-asking language model SQLM and self-evolving reasoning model R-Zero), Dr. Zero performed best in all tasks, with average performance surpassing SQLM and R-Zero by 39.9% and 27.3%, respectively. This is particularly evident in complex multi-hop tasks. Through its difficulty-guided reward-generated problems, Dr. Zero achieved an average performance improvement of 83.3% compared to the optimized R-Zero*, highlighting its unique advantage in promoting complex reasoning capabilities.

3. Significant scale effect, verifying framework scalability.

The research team also observed a clear model scale effect. Models with a 7B parameter scale performed particularly well on complex multi-hop reasoning datasets like 2WikiMQA, achieving significant relative improvement (7.67%). This indicates that the Dr. Zero framework has good scalability, and larger-scale models can more effectively utilize this self-evolution mechanism to handle more complex and intertwined search and reasoning tasks.

Author: Wang Yueran

For reprint or submission, please leave a comment directly in the comment section of this article.