With the rapid development of MCP, Agent Skills, and various Harnesses, large models can easily invoke hundreds or thousands of external tools. However, they still have obvious shortcomings in multi-tool tasks featuring complex states and long-range interactions. Although a series of environment expansion methods attempt to replicate real-world interactive environments (such as booking systems and food delivery platforms), they are still limited by the scale and authenticity of environment expansion. Moreover, no matter how many training environments are created, when an agent faces a new interactive environment, it is still difficult to generalize without continuous learning training algorithms.
To this end, this paper proposes Agent-World: a universal agent training ground that combines "agent environment exploration" with "self-evolving training" to form a closed loop for the co-evolution of agents and environments.
Agent-World consists of two core modules:
(1) Intelligent Environment-Task Exploration: Through deep research agents, it autonomously mines environment databases from the internet around real-world environment themes, generating executable tools and verifiable tasks.
(2) Continuous Self-Evolving Training: It trains agents through multi-environment reinforcement learning, treats the synthetic environment as a natural training ground, automatically diagnoses the agent's capability weaknesses, and drives targeted environment/task expansion to achieve the agent's self-evolution.
Figure 1: Overview of Agent-World. The left figure shows the closed loop of co-evolution between agents and environments in Agent-World, and the right figure shows the curve of downstream performance versus environment expansion.
Ultimately, Agent-World built 1,978 environments, 19,822 tools, with tasks averaging more than 15 interaction turns. Experiments show that on 23 challenging benchmarks (including τ²-Bench, BFCL V4, MCP-Mark, ClawEval, SkillsBench, etc.), Agent-World-8B/14B consistently outperforms advanced environment expansion methods and strong open-source base models. Further experimental analysis shows a scalable relationship between environment diversity, self-evolution rounds, and agent performance.
Paper Title: Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Project Homepage: https://agent-tars-world.github.io/-/
Agent-World has currently gained significant attention on X and has reached second place on the Hugging Face Papers daily chart!
Agent-World: Scaling World Environments for Agent-Environment Co-Evolution!
1. Intelligent Environment-Task Mining: Automatically Mining Real-World Environments from the Web
Traditional environment synthesis methods either rely on direct LLM generation or are limited to finite open-source tool data. Agent-World takes an interesting approach: starting from real-world environment themes, it lets a deep research agent autonomously mine environments from the vast internet.
Figure 2: Intelligent Environment-Task Mining Process: includes an overview of the overall process (top) and a fine-grained display of each step (bottom).
(1) Intelligent Database Mining: Agent-World selects real MCP server data, open-source tool documents, industry requirement documents, etc., as thematic anchors (over 2,000). For each theme, it uses a deep research agent equipped with four types of tools: search, browsing, code compiler, and file system. It autonomously mines theme-related environment databases from massive internet web pages and enhances the scale and structural authenticity of the database through iterative data complexification.
(2) Tool Interface Generation and Verification: Agent-World further introduces a code agent to generate tool interfaces and unit test scripts for each environment. Through a triple-rule filter of "compilability, test accuracy, and minimum environment validity," it finally obtains a series of interactive environments containing real databases and executable tool sets.
(3) Hierarchical Environment Classification System: To support cross-environment task synthesis and layered evaluation, this work further structures the massive environment ecosystem. Through thematic clustering combined with large model and manual verification, Agent-World divides the environment ecosystem into a three-layer environmental label classification system of 20 / 50 / 1978 (as shown in the figure below).
Figure 3: Agent-World's hierarchical environment classification. The left figure shows 20 primary level environments, and the right figure shows the number of tertiary environments corresponding to the top 10 secondary environments.
(4) Verifiable Task Synthesis: Based on the high-quality environment ecosystem, Agent-World employs two complementary verifiable task synthesis strategies:
• Graph-based Task Synthesis: Builds a fully connected dependency graph for the tools in the environment, generates reasonable tool call sequences through random walks, and then "reverse-engineers" natural language questions from the chain, accompanied by a large model scoring rubric. This method excels at modeling sequential dependency logic.
• Programmatic Task Synthesis: Directly lets the LLM generate a Python script requiring complex control flow to solve some problem, reverse-generates the corresponding problem, and provides executable verification code. This method can capture non-linear complex reasoning.
(5) Statistical Analysis of Synthesized Environments: The following figure provides detailed statistics on the distribution of environments and tasks. After multiple filtering stages, Agent-World finally settles on 1,978 environments, 19,822 tools, with the average number of tools per single environment exceeding 10, showing considerable volume and balanced granularity. The environment databases span multiple file formats including JSON, CSV, SQL, HTML, TeX, YAML, exhibiting high heterogeneity in both structure and semantics.
The synthesized tasks are predominantly "long-range multi-turn," averaging more than 15 interaction turns, imposing sustained pressure on planning, memory, and error recovery. In terms of difficulty, even Doubao-Seed 2.0 cannot correctly complete a significant proportion of tasks under the Pass@10 setting, reflecting the overall challenging nature of the tasks.
In summary, the static statistics jointly verify the significant advantages of Agent-World synthetic interactive environments in terms of diversity, heterogeneity, and complexity across four dimensions: scale, format, interaction length, and difficulty.
Figure 4: Six-dimensional statistical analysis of Agent-World's synthesized environments and tasks.
2. Continuous Self-Evolving Agent Training: Enabling Co-Evolution of Agent and Environment
After building a scalable, realistic environment ecosystem, Agent-World transforms it into a dynamic agent training ground (see figure below).
Figure 5: Continuous Self-Evolving Agent Training Framework. The upper part shows multi-environment reinforcement learning training, and the lower part shows the diagnosis and co-evolution cycle.
(1) Multi-Environment Reinforcement Learning: Unlike traditional Agent RL, our training unfolds within a closed-loop interaction of "Agent-Tool-Database." The agent performs rollouts in different environments; while calling tools, it also rewrites the state of the underlying database, allowing learning signals to truly root in the executable world environment. Algorithmically, Agent-World adopts the widely used GRPO to maximize the above-mentioned verifiable rewards, steadily improving agent performance.
The reward side is also differentiated by task type: Graph-synthesized tasks are scored item by item by the large model according to a verification rubric; programmatic tasks directly execute the verification script and award points based on the correctness of the final answer or state.
(2) Self-Evolving Agent Arena: The core of Agent-World lies in treating the entire environment ecosystem as a natural agent training arena. Training is not a one-off process but a multi-round iterative self-evolving process:
Dynamic Evaluation Task Synthesis: After each training round, a batch of new environments is evenly sampled from the arena's environment pool according to the environment classification system, and brand-new evaluation tasks are synthesized for them, avoiding the issue of "retesting problems already practiced."
Agent-Based Diagnosis: The current round's agent is evaluated on these new tasks; a diagnostic agent subsequently analyzes its failure trajectories, error distribution, and environment meta-information, locates capability weaknesses (e.g., "errors in creating secondary headings in the Notion environment"), and outputs a weakness environment ranking and a targeted task generation guide.
Agent–Environment Co-Evolution: Based on the diagnostic results, more challenging training tasks are synthesized in the weak environments, and the corresponding environment databases are further complexified as needed; then, this batch of "weakness-targeted customized data" drives the next round of continuous reinforcement learning.
The above process forms an interesting training flywheel: "Training improves agent → Evaluation exposes weaknesses → Diagnosis guides environment/task expansion → New data drives further agent evolution." This closed loop enables a true "co-evolution" of the agent and its training environment.
Experimental Results: 23 Benchmarks Validate Agent-World's Cross-Domain Agent Capabilities
Experimental Setup: To fully evaluate generalization, Agent-World assesses 5 major domain categories, covering a total of 23 evaluation benchmarks:
• Agent Tool Use: MCP-Mark, BFCL V4, τ²-Bench
• Advanced AI Assistants: SkillsBench, ARC-AGI-2, ClawEval
• General Reasoning: MATH500, GSM8K, MATH, AIME24/25, KOR-Bench, OlympiadBench, etc.
• Deep Search and Software Engineering: WebWalkerQA, SWE-Bench, Terminal-Bench, GAIA, HLE, etc.
• Knowledge and MCP: MMLU, SuperGPQA, MCP-Universe, etc.
The comparison baselines include advanced closed-source models (GPT-5.2 High, Claude Sonnet-4.5, Seed2.0, etc.), strong open-source base models (DeepSeek-V3.2-685B, Qwen3-235B-A22B), and advanced environment expansion methods (EnvScaler, AWM, ScaleEnv).
1. Outstanding Performance on Core Agent Tasks
Table 1: Results on core agent tool use benchmarks.
As shown in the table above, on the three currently most challenging agent tool use benchmarks—MCP-Mark, BFCL V4, and τ²-Bench—Agent-World-8B and 14B stably outperform all open-source environment expansion baselines. These three benchmarks respectively assess multi-turn stateful interactions, cross-domain tool calls, and long-range dialogues. Even cutting-edge closed-source models only score around 50 on MCP-Mark.
More interestingly, Agent-World-14B achieves 55.8% on BFCL V4, surpassing DeepSeek-V3.2-685B (54.1%), a model with 685B parameters. This indicates that more realistic executable environments and verifiable rewards align more effectively with complex agent interaction patterns than sheer parameter count does.
2. Remarkable Long-Range Agent Reasoning Capabilities
Figure 6: Radar chart showing the generalization performance of Agent-World-8B across three capability groups: general reasoning, agent search & coding, and knowledge & MCP, comprehensively leading the baselines.
As shown in the figure above, when evaluation is extended to 17 benchmarks covering long-range reasoning, deep search, software engineering, and knowledge applications, Agent-World-8B still maintains a lead in all dimensions. General reasoning (MATH500, AIME, OlympiadBench, etc.) did not degrade due to agent-related training but instead saw slight improvements. The advantages are particularly pronounced in ultra-long-turn tasks in deep search and software engineering domains (GAIA, SWE-Bench, Terminal-Bench, etc.).
Besides that, performances on other knowledge and MCP benchmarks are also excellent, proving that the skills acquired by Agent-World through environmental training are transferable and composable, rather than being overfitting to specific benchmarks.
Figure 7: The Agent-World model family demonstrates superior performance on cutting-edge AI assistant benchmarks such as SkillsBench, ARC-AGI-2, and ClawEval.
3. Significant Improvement in Advanced AI Assistant Scenarios
As shown in the figure above, Agent-World also excels on the three latest benchmarks—SkillsBench, ARC-AGI-2, and ClawEval—which demand long-range planning and real-world execution. Moreover, the performance improvement from the 8B to 14B model scale is stable, whereas other baseline models exhibit capability fluctuations.
Quantitative Analysis: How Do Environment Scale and Self-Evolution Drive Performance?
Besides the main experimental results, Agent-World also conducted a series of interesting quantitative analyses.
1. Training Environment Scale Scaling Analysis
Figure 8: Downstream agent performance significantly improves with the increase in the number of training environments, presenting a clear scaling law.
As the number of training environments gradually increases (from 0 to nearly 2,000), agent performance shows a clear positive correlation with the number of environments. In the early stage (10 to 100 environments), the performance improvement is rapid, indicating that covering key interaction patterns is crucial; later, the improvement slows but continues, suggesting that larger-scale environments bring more fine-grained capability enhancements.
2. Analysis of Self-Evolution Rounds
Table 2: The effect of continuous autonomous evolution.
The study verified the effectiveness of the self-evolving arena closed loop. Whether it was the Agent-World model itself or the baseline EnvScaler-8B model, after going through two cycles of "evaluation-diagnosis-targeted training," consistent performance gains were achieved across multiple benchmarks. This proves that treating the environment as a training ground to drive targeted data synthesis is an effective mechanism for continuously enhancing an agent's environmental generalization capabilities.
3. Analysis of Multi-Environment Reinforcement Learning Curves
Figure 9: Display of multi-environment agent reinforcement learning curves.
Although Agent-World conducts reinforcement learning on complex, mixed environments and diverse synthesized tasks (graph-based and programmatic), its reward score steadily increases with training steps, while policy entropy remains relatively stable or even grows. This indicates that the agent maintains good exploration properties while adapting to new environments, without prematurely falling into locally optimal "ossified" behaviors.
Summary and Outlook
Agent-World aims to achieve continuous co-evolution of agents and their environments by scaling real-world environments. As the authors of this paper, we also want to put forward a few insights discovered during this research, for reference and joint exploration by friends studying the direction of general agent training:
Authenticity is the Cornerstone of Environmental Expansion: Building highly authentic, logically verifiable environments is a prerequisite for training general agents. Agent-World uses an agent-based pipeline to interface real themes with vast network information, automatically mining data and tools. We believe this is only the beginning, and more automated, realistic environment synthesis paradigms will emerge in the future.
Evolution is the Engine of Environmental Training: Once a large-scale environmental ecosystem is built, a single static training session is both insufficient and wastes the costly constructed environments. Agent-World has built a closed-loop system that can automatically diagnose weaknesses and directionally generate challenges, allowing the agent and the environment to co-evolve. How to deeply couple the environmental ecosystem with training algorithms remains a long but worthwhile path to bet on.
Environment/Task Scalability Leads to Generalization: In Agent-World, we observe a clear scaling relationship between "environment scale, evolution rounds, and task difficulty" and agent performance. This suggests that future work should simultaneously scale up "more diverse environments, more complex tasks, and more rounds of evolution"—this might just be a key to achieving general agent interaction capabilities.
Author Introduction: The first author of this paper is Guanting Dong, a second-year Ph.D. student at Gaoling School of Artificial Intelligence, Renmin University of China, under the supervision of Prof. Zhicheng Dou and Prof. Jirong Wen. His main research direction is general agent training. He has published over 10 papers as first or co-first author at top international conferences such as ICLR and ACL. Representative works include ARPO, AUTOIF, Search-o1, Webthinker, FlashRAG, etc. His Google Scholar citations exceed 10,000, and his personal GitHub projects have over 8,000 stars. He has interned at foundational model teams such as ByteDance Seed and Alibaba Tongyi Qianwen. He has received honors including the first Tencent Qingyun Scholarship, the National Scholarship, and the Outstanding Graduate of Beijing. The corresponding authors of this paper are Prof. Zhicheng Dou from Renmin University of China and Wanjun Zhong from ByteDance Seed.