Hello everyone, I'm PaperAgent, not an Agent!
Recently, a joint team from Xiamen University, Hong Kong Polytechnic University, University of Maryland, Washington University in St. Louis, UIUC, Singapore Management University, and other institutions released a systematic survey on Self-Evolving Agents:
This survey attempts to answer a question that is becoming increasingly important:
When an LLM Agent is no longer just trained on human-annotated data, but can actively explore, obtain feedback, update strategies, and accumulate experience, how should we understand its "self-evolution"?
From 2022 to 2026, agent research has rapidly evolved from focusing on model capability enhancement to obtaining feedback and accumulating experience through environmental interaction, and further developing into a new paradigm of co-evolution where the model and environment drive each other. An increasingly clear technical main line is forming:
An agent's capability boundary is determined not only by model parameters but also by how it interacts with its environment and continuously obtains usable learning signals from those interactions.
1. Why Do We Need Self-Evolving Agents?
Traditional agent systems mostly rely on a "two-stage paradigm":
- Pre-Training: Learning general world knowledge through large-scale corpora;
- Post-Training: Teaching the model specific agentic capabilities through SFT, RLHF, RLAIF, or task-specific data.
This paradigm has greatly advanced the development of LLM Agents, but it also has an increasingly obvious bottleneck:
The more complex the agent, the stronger its dependence on high-quality supervisory signals; however, high-quality human annotation, manual rewards, and expert feedback are difficult to scale infinitely.
For simple Q&A tasks, humans can directly write answers; for complex Agent tasks, humans must not only judge the final answer but also understand multi-step planning, tool use, environmental feedback, error recovery, and long-term state changes. The supervision cost rises sharply.
More critically, if an agent always relies on humans to provide learning signals, its capability ceiling can easily be constrained by human experience, annotation scale, and predefined task boundaries.
Therefore, the core motivation for Self-Evolving Agents is:
To shift agents from passively accepting human supervision to proactively constructing problems, exploring environments, generating feedback, correcting strategies, and continuously improving in a closed loop.
This survey summarizes Self-Evolving Agents with two core characteristics:
- Strong autonomy with minimal human supervision: Minimize dependence on external manual supervision;
- Active exploration through interaction: Actively explore and improve through internal reasoning or external environmental interaction.
In other words, a self-evolving agent is no longer just a "trained model" but more like a system that can participate in its own growth process.
2. Unified Taxonomy: Three Routes of Self-Evolution
The most important contribution of this survey is the proposal of a unified taxonomy, dividing Self-Evolving Agents into three major paradigms:
- Model-Centric Self-Evolution;
- Environment-Centric Self-Evolution;
- Model-Environment Co-Evolution.
Figure 2 presents the core classification framework of the paper. The key to this framework is that it organizes the entire field not simply by task type or technical module, but by "where the evolution happens":
- If evolution primarily occurs inside the model, it is Model-Centric;
- If evolution comes from the model's utilization of external knowledge, experience, tools, and structures, it is Environment-Centric;
- If both the model and the environment are continuously changing and mutually pushing each other to become stronger, it is Model-Environment Co-Evolution.
The importance of this perspective lies in its unification of previously scattered research directions into a progressive framework:
From capability enhancement driven by internal model computation and parameter updates, to experience accumulation driven by environmental interaction and feedback, and finally to the mutual adaptation and co-evolution of the model and environment.
Figure 3 further expands on the complete technical taxonomy of Self-Evolving Agents, systematically organizing methods under different evolutionary paths and showcasing the overall technical landscape of this field, from internal capability enhancement and external environmental interaction to model-environment co-evolution. It can essentially serve as a technical map for understanding the current research landscape of Self-Evolving Agents.
3. Model-Centric Self-Evolution: The Model Gets Stronger on Its Own First
The first route is Model-Centric Self-Evolution.
The basic assumption of this type of method is that the model already contains a large amount of latent capabilities that have just not been fully activated. Therefore, self-evolution can first start from the model itself, enhancing capabilities through more reasoning computation, better search strategies, or self-generated training data.
This route can be further divided into two categories:
3.1 Inference-Based Evolution: Self-Evolution at Inference Time
These methods do not update model parameters but instead invest more computational resources during a single inference process to let the model "think more thoroughly." Representative directions include:
- Parallel Sampling: Sampling multiple reasoning paths in parallel, then selecting answers through voting, ranking, or consistency checks;
- Sequential Self-Correction: Generating, reflecting, and correcting in multiple rounds of self-correction;
- Structured Reasoning: Organizing the reasoning process into structures like trees or graphs.
Its essence is:
Exchanging more test-time compute for a more reliable single output.
But the problem is also obvious: this improvement is usually temporary. After reasoning ends, the model parameters remain unchanged, and the capability is not truly internalized.
3.2 Training-Based Evolution: Self-Evolution at Training Time
In contrast, Training-Based Evolution pursues long-term capability improvement. The model generates data, filters data, evaluates data, and writes new capabilities back into parameters through SFT or RL.
This survey divides it into two paths:
- Synthesis-Driven Offline Self-Evolving: Generating synthetic data offline for training;
- Exploration-Driven Online Self-Evolving: Online exploration, real-time feedback, and continuous policy updates.
Figure 4 clearly demonstrates the differences between the two. Offline synthesis methods are more like "the model creating its own textbook," which can be started efficiently but is easily limited by the initial model's capabilities. Online exploration methods are more like "the model constantly learning through trial and error," capable of discovering new strategies but demanding higher requirements for feedback quality, training stability, and exploration efficiency.
This is why works like R-Zero, Absolute Zero, and Agent0 have recently gained attention: they are not satisfied with letting the model regurgitate existing knowledge but try to enable the model to obtain new training signals through self-play, environmental feedback, or task exploration.
4. Environment-Centric Self-Evolution: The Environment Becomes a Source of Capability
The second route is Environment-Centric Self-Evolution.
If Model-Centric methods mainly focus on how the model internally gets stronger, Environment-Centric methods emphasize that:
The agent's evolution comes not only from parameter updates but also from how it utilizes external knowledge, experience, tools, memory, and multi-agent structures.
This survey divides environment-centric self-evolution into four directions:
- Static Knowledge Evolution;
- Dynamic Experience Evolution;
- Modular Architecture Evolution;
- Agentic Topology Evolution.
4.1 Static Knowledge Evolution: From Answering Questions to Proactively Seeking Knowledge
Traditional RAG usually involves "the user asks a question, and the system retrieves relevant documents." However, Agentic RAG and Deep Research go further: The agent judges what knowledge it lacks, proactively generates queries, browses the web, collects evidence, integrates reasoning, and ultimately generates a structured report.
This means retrieval is no longer just a pre-module but becomes an active cognitive behavior within the agent's reasoning chain.
4.2 Dynamic Experience Evolution: From Knowledge to Experience
Knowledge addresses "what is," while experience addresses "how to do."
Many agent tasks fail not from a lack of knowledge, but a lack of experience:
- Which tool-calling sequence is more stable?
- How should specific errors be recovered from?
- Which historical failures can guide current decisions?
- Which workflows can be reused for new tasks?
Therefore, Dynamic Experience Evolution focuses on how to extract reusable experiences from historical trajectories, success cases, failure feedback, and execution logs.
Figure 5 compares Static Knowledge Evolution and Dynamic Experience Evolution together. The former is more suitable for knowledge-intensive tasks, such as Q&A, search, and research; the latter is more suitable for logic-intensive, long-term planning, multi-turn interaction, and embodied tasks, as these tasks depend more on transferable behavioral experience.
4.3 Modular Architecture Evolution: Memory, Tools, and Interfaces Must Also Evolve
An agent's interaction with the environment does not happen directly but is completed through a series of modules, such as:
- Memory Module;
- Tool Module;
- Interaction Interface;
- Protocol;
- Skill Library.
These modules themselves can also evolve.
For example, Memory is no longer just a vector database but can be a system that actively decides to retain, forget, merge, rewrite, and route. Tools are not just predefined APIs but can be automatically created, composed, and maintained by the agent. Interaction Interfaces can also be designed to be more suitable for model understanding and operation, thereby improving the agent's stability.
This indicates that the agent's capability improvement is not just about a "stronger model" but also about "a system structure more suitable for the model to leverage."
4.4 Agentic Topology Evolution: Multi-Agent Structures Evolving Themselves
Multi-agent systems have often relied on manually designed roles and processes, such as planner, executor, critic, reviewer, etc.
But in complex tasks, fixed processes may not be optimal. Therefore, Agentic Topology Evolution studies how the communication structure, role assignment, team size, and collaborative topology of multi-agent systems can be automatically searched for or dynamically adjusted.
The core question of such methods is:
Can the organizational form of a multi-agent system also become an object that is learnable, optimizable, and evolvable?
5. Model-Environment Co-Evolution: A Key Future Direction
The third route, and the future direction most emphasized by this survey, is Model-Environment Co-Evolution.
The first two categories have their own limitations:
- Model-Centric methods can easily lack external validation, potentially leading to error accumulation, self-reinforcing hallucinations, and overestimation of high-variance trajectories;
- While Environment-Centric methods introduce external knowledge and feedback, many environments are still static, single-task, and non-scalable.
Therefore, a more ideal direction is:
Not just the model adapting to the environment, but the environment also changing as the model's capabilities change.
Figure 6 summarizes the advantages of Model-Environment Co-Evolution: The environment can dynamically adjust difficulty based on the agent's capabilities, provide targeted feedback on demand, and expand into a multi-task, verifiable, and sustainably growing training ground.
This route contains two core directions:
5.1 Multi-Agent Policy Co-Evolution
In multi-agent scenarios, the environment itself can be composed of other agents. The collaboration, competition, evaluation, and communication between agents will form a dynamic learning field.
For example, multiple agents can provide feedback to each other through peer evaluation, or jointly optimize policies through multi-agent reinforcement learning. At this point, the environment is no longer a static backdrop but is jointly composed of other learning agents.
5.2 Environment Training
Another route is to directly train or generate environments.
An ideal environment should possess several characteristics:
- Able to provide verifiable feedback;
- Able to automatically adjust difficulty based on the agent's capability;
- Able to generate diverse tasks;
- Able to support long-term, open-ended exploration.
Works such as Reasoning Gym, AgentGym, and Agent-World are developing in this direction.
This is also an important judgment of this paper:
The core challenge for future Self-Evolving Agents is not just training stronger agents, but designing environments that can grow alongside the agents.
GitHub: https://github.com/XMUDeepLIT/Awesome-Self-Evolving-Agents