Abstract
Recent advances in LLM-based agent systems have shown promise in tackling complex, long-horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under-specify cross-entity lifecycle and context management, version tracking, and evolution-safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce AUTOGENESIS PROTOCOL (AGP), a self-evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol-registered resources with explicit state, lifecycle, and versioned interfaces. Its Self-Evolution Protocol Layer (SEPL) specifies a closed-loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present AUTOGENESIS SYSTEM (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long-horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed-loop self-evolution.
1. Introduction
Recent advances in LLM-based agent systems have demonstrated significant potential in tackling complex, long-horizon tasks. However, static agent designs often prove insufficient when facing the diversity and stochasticity of real-world environments. To overcome this limitation, endowing agents with self-evolution capabilities—enabling them to automatically adjust strategies, refine instructions, and update tools based on environmental feedback—has emerged as a critical avenue for achieving robust autonomy. This transition from predefined execution to dynamic adaptation represents a fundamental shift in agentic system design.
Despite the growing interest in self-evolving agents, implementations remain largely fragmented and ad hoc. Existing systems often lack shared standards, rendering the evolution process neither composable nor auditable. Developers are frequently forced to rely on brittle glue code, leading to monolithic architectures that are difficult to maintain. Furthermore, without explicit lifecycle management and safe update interfaces, self-modification introduces significant risks of runtime instability. To address these issues, it is necessary to elevate development from ad hoc engineering practices to the protocol level, decoupling 'what evolves' from 'how evolution occurs' via a standardized framework to ensure modular, traceable, and safe evolution.
While protocols such as Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent (A2A) have standardized connectivity, applying them directly to self-evolution scenarios presents a conceptual mismatch. These protocols are primarily designed to resolve connectivity challenges—specifically, model-tool invocation (MCP) or inter-agent communication (A2A). However, the core of self-evolution lies not in invocation, but in state mutation and management.
Existing connectivity protocols lack native support for entity Lifecycle and Version Lineage. In a closed-loop evolutionary system, if the creation, update, and destruction of components are not precisely defined, the optimizer cannot safely apply modifications. Moreover, the absence of version tracking and rollback mechanisms means that erroneous updates can lead to irrecoverable errors. Consequently, relying solely on communication protocols is insufficient; a novel protocol capable of managing the dynamics of mutation is required.
To bridge the gap from connectivity to evolution, a specialized protocol must address three essential problems:
- Decoupling: Resources such as prompts, tools, and memory must be abstracted from the agent's core logic, transforming them into passive, independently managed entities rather than tightly coupled code blocks.
- Safety & Auditability: Strict version control and rollback mechanisms must be introduced to ensure that every evolutionary step is traceable and reversible.
- Formalism: A set of standardized operators (e.g., reflect, propose, verify) needs to be defined to strictly govern the evolution process, converting heuristic text modifications into a rigorous control loop.
To address these challenges, we introduce AUTOGENESIS. Far from being merely a utility library, AUTOGENESIS is a two-layer protocol architecture designed to strictly decouple the evolutionary substrate from the evolutionary logic. Our core motivation is to standardize underlying resource representations, enabling the same optimization algorithms to be seamlessly applied across diverse agent components.
- Layer 1: Resource Substrate Protocol Layer (RSPL). This layer defines the substrate of evolution, modeling Prompts, Agents, Tools, Environments, and Memory as Protocol-registered Resources. RSPL endows these resources with explicit state, lifecycle, and versioned interfaces, rendering them standardized objects amenable to observation and manipulation.
- Layer 2: Self-Evolution Protocol Layer (SEPL). This layer establishes a closed-loop operator interface grounded in control theory. It defines atomic operations—Reflect, Select, Improve, Evaluate, and Commit—to formally execute the evolution cycle, ensuring that every self-modification is documented and adheres to strict safety constraints.
Building on this protocol, we present AUTOGENESIS AGENT, a reasoning-and-acting tool-calling agent. Instead of relying on hard-coded components, it dynamically instantiates, retrieves, and refines resources via protocol interfaces during execution. We evaluated this system on multiple challenging benchmarks, including GPQA, AIME, GAIA, and LeetCode. The results demonstrate that by leveraging standardized resource management and closed-loop evolution, AUTOGENESIS AGENT consistently achieves significant improvements over strong baselines.
The significance of this work extends beyond performance gains; it illustrates a potential shift from manual prompt engineering to automated protocol engineering. By equipping agents with standardized self-repair and evolution capabilities, AUTOGENESIS provides a foundational paradigm for building next-generation agent systems capable of sustained autonomous adaptation in complex environments.
2. Related Work
2.1. LLM-based Agent Systems and Tool Use
Recent progress in large language model (LLM) based agent systems has demonstrated their ability to address complex, long-horizon tasks that require multi-step reasoning and external tool interaction. In these systems, LLMs typically serve as centralized decision-making modules that interpret observations, decompose tasks, and invoke tools to affect the environment. Benchmarks such as GAIA have further highlighted the importance of structured tool use and planning capabilities in agent design.
Most existing agent frameworks adopt architectures in which prompts, tools, and memory are embedded as tightly coupled internal components. Tools are commonly treated as fixed functional modules that are manually curated and integrated into the agent pipeline. While effective for bounded tasks, this design limits systematic reuse and controlled adaptation of tools as task requirements evolve. In contrast, our approach models tools (including native scripts, MCP tools, and agent skills) as protocol-registered resources with explicit interfaces and state representations, enabling dynamic instantiation and controlled refinement during execution.
2.2. Connectivity and Interoperability Protocols
As agent-based systems grow in scale and complexity, several protocol-level efforts have emerged to standardize model-tool interaction and inter-agent communication. Anthropic's Model Context Protocol (MCP) provides a unified interface for connecting language models to external tools and data sources. Similarly, Google's Agent-to-Agent (A2A) protocol aims to standardize communication primitives that support collaboration among multiple agents.
These protocols primarily address interoperability at the level of invocation and message passing. They specify how agents and tools interact, but largely leave the internal state of agents and resources opaque. In particular, they do not define mechanisms for managing resource lifecycles, tracking version lineage, or constraining state mutations over time. As a result, while connectivity protocols simplify integration, they do not directly support the persistent state evolution required by self-modifying agent systems.
2.3. Self-Correction and Optimization Mechanisms
A parallel line of work investigates mechanisms that enable agents to improve their performance through self-correction and optimization. Methods such as TextGrad interpret natural language feedback as a signal analogous to gradients, enabling iterative updates to string-valued components such as prompts. Reinforcement learning based approaches have also been applied to agent improvement. Techniques including Reinforce++ and GRPO frame agent components as policies and use evaluation signals as rewards to guide optimization.
While these methods demonstrate that agent behaviors can be iteratively improved, they are typically applied within narrowly scoped settings and lack a shared abstraction for managing heterogeneous agent components. Updates are often applied directly to prompts or policies without explicit lifecycle control, version tracking, or rollback support. AUTOGENESIS provides a protocol-level abstraction that accommodates these optimization strategies by exposing agent components as standardized, evolvable resources and defining operator-level interfaces through which different optimization methods can be applied in a controlled manner.
2.4. Summary
Existing work on agent systems, interoperability protocols, and self-optimization has laid important foundations for autonomous behavior. However, these efforts do not provide a unified protocol for managing the persistent state evolution of agent-internal resources. In particular, current connectivity protocols emphasize interaction but do not address lifecycle management or versioned state mutation. AUTOGENESIS addresses this gap by introducing a two-layer protocol architecture that separates the definition of evolvable resources from the mechanisms that govern their evolution, enabling modular, traceable, and auditable self-evolution in multi-agent systems.
3. Autogenesis
Despite growing interest in self-evolving agents, most systems remain engineered in an ad hoc manner and lack a shared protocol standard that makes evolution composable, auditable, and interoperable. We introduce AGP, a two-layer self-evolution protocol. The Resource Substrate Protocol Layer (RSPL) specifies the evolvable substrate, namely which resources may change and how they are represented, versioned, and accessed. The Self-Evolution Protocol Layer (SEPL) specifies the evolution logic, namely how updates are proposed, assessed, and committed through a safe operator interface. Inspired by interface standardization efforts in agent tooling, this separation cleanly decouples what evolves from how evolution occurs, enabling modularity, traceability, and safety-preserving evolution across components.
3.1. Layer 1: Resource Substrate Protocol Layer
The Resource Substrate Protocol Layer (RSPL) defines the evolvable substrate as a set of protocol-registered resources with explicit state, lifecycle, and version lineage. In this paper, these resources comprise (i) instructions (Prompt), (ii) decision policies (Agent), (iii) actuation interfaces (Tool), which encompass native tool scripts, MCP tools, and agent skills, (iv) task/world dynamics (Environment), and (v) persistent state (Memory). Crucially, resources in RSPL are passive: they encapsulate no optimization logic and cannot self-modify; all observations and state transitions occur only through controlled, interface-mediated operations invoked by higher layers.
3.1.1. CORE ENTITIES
We focus on these five entity types as a minimal yet expressive substrate for agentic systems. This choice is not intended to be exhaustive, but rather to identify a common denominator across modern agent stacks and provide a uniform target space on which SEPL can operate.
Definition 3.1 (Resource Entity). A resource entity of type τ and its type-level collection can be represented as: , where denotes the set of RSPL entity types, indexes the entity type, is the index set of resource instances of type , and indexes an individual instance. Here is a unique resource name, is a short description, is an input-to-output mapping, is the trainable marker that indicates whether the resource is evolvable, and is an auxiliary metadata dictionary.
A key motivation for making prompt, tool, and memory explicit RSPL resources is decoupling. Many agent systems package prompts, tools, and memory as internal components of an agent, which entangles agent logic with task-specific instructions and capability bundles, increasing maintenance and limiting transfer. By externalizing them as first-class, versioned resources with standardized interfaces, the same tool-calling agent policy can be paired with different prompts and tool sets, and deployed unchanged across tasks and environments.
To support resource registration, unified management, and instantiation, RSPL stores a serializable registration record for each resource instance.
Definition 3.2 (Resource Registration Record). A resource registration record and its type-level collection can be represented as: , where indexes the entity type and indexes an individual instance. Here is the resource entity tuple defined in Theorem 3.1, is a version string, is an implementation descriptor (e.g., import path, class definition, or source-code string), are instantiation parameters (e.g., constructor arguments), and is a set of exported representations used by LLMs to interact with the resource (e.g., function-calling schema, natural-language text, and structured argument schema).
Definition 3.3 (Protocol-registered resource). For each entity type , let denote the type-specific registry of protocol-registered resources, and let denote the global registry. RSPL binds each entity type to a dedicated context manager and a server-exposed interface . We represent the type-level registered resource as , where each is a registration record in Theorem 3.2. The context manager maintains the collection , the version lineage for type , and implements lifecycle and update operations over these records; the server-exposed interface encapsulates and exposes a unified external interface by delegating requests to the corresponding context-manager routines.

Context manager. The context manager implements the management plane for each resource type. Beyond lifecycle control and dependency constraints, it maintains (i) an active registry of materialized resources and (ii) a versioned history for restoration. Its exported API can be viewed as a small set of functionally grouped operators for lifecycle and registration (e.g., init, build), retrieval and inspection (e.g., list, get state), evolution and versioning (e.g., update, restore), execution and contract (e.g., run, load contract), and serialization and deserialization (e.g., save to json, load from json). The manager explicitly supports contract generation, producing a consolidated capability and constraint specification for the managed entities, which provides stable, up-to-date descriptions that improve reliability and reduce prompt bloat, enabling systematic context engineering via controlled prompt injection. For instance, for tools (which may be native scripts, MCP tools, or agent skills) the contract can take a skills.md-style form that enumerates tool actions, arguments, preconditions, and usage constraints.
Server interface. The server is introduced to encapsulate the context manager's internal complexity and present a stable, simplified interface for external callers. It packages heterogeneous management routines behind a uniform set of endpoints with consistent request/response semantics, while delegating the implementation details to the context manager. This separation isolates clients from internal design changes, reduces coupling, and provides a single control plane through which the protocol mediates safe, version-aware interactions with RSPL resources.
3.1.2. INFRASTRUCTURE SERVICES
RSPL further includes cross-cutting services that support reliable evolution, including reproducibility, safe deployment, and versioned recovery:
- Model manager. A unified model-API layer that standardizes calls across providers (e.g., OpenAI, Anthropic, Google, and OpenRouter, etc.), while supporting routing, fallback, and cost-aware selection to keep model access consistent as components evolve.
- Version manager. Maintains version lineage for each resource, enabling rollback, branching, and diffing. Versions are auto-incremented identifiers (e.g., semantic versions) assigned on register or update, each referencing an immutable snapshot of the configuration record and associated artifacts for auditability and reproducibility.
- Dynamic manager. Handles serialization or deserialization of resource configurations for persistence and transfer, enabling safe hot-swapping of resources at runtime without restarting the agent system.
- Tracer Module. A module that captures fine-grained execution traces (inputs, outputs, intermediate decisions, tool interactions, etc.) for interpretability and debugging, and as training signals for dataset synthesis and retrospective improvement.
3.2. Layer 2: Self-Evolution Protocol Layer (SEPL)
The Self-Evolution Protocol Layer (SEPL) establishes a control-theoretic formalism for agentic system evolution. It conceptualizes the continuous improvement of an agentic system as a generalized optimization problem defined over a heterogeneous state space. Formally, SEPL models evolutionary dynamics as a state transition function governed by a strictly typed operator algebra.
By mediating all state mutations through standardized RSPL interfaces, the protocol guarantees that evolution is traceable, reversible, and safe-by-construction. While this paper focuses on the reflection-driven optimizer as the primary instantiation, our implementation also supports other optimization strategies, including TextGrad, GRPO, and Reinforce++, utilizing the same state manipulation primitives.
3.2.1. EVOLVABLE VARIABLES
To transition from heuristic adaptation to a systematic evolution protocol, we introduce the concept of variable lifting. This abstraction projects discrete, heterogeneous RSPL resources (e.g., tool code, system prompts) onto a unified representation of evolvable variables. This formalism offers significant theoretical advantages by homogenizing the interaction surface for evolutionary operators and rigorously delineating the trainable subspace via an explicit learnability mask.
Definition 3.4 (Evolvable Variable Set). We define the universal set of evolvable variables, , as the union of all managed resource entities and execution artifacts: , where denotes the set of resource entities of type governed by the RSPL. The element encapsulates execution artifacts, specifically final outputs and reasoning traces, which constitute the observational basis for retrospective optimization. Furthermore, each variable is associated with a binary learnability constraint , thereby strictly defining the trainable parameter subspace .
3.2.2. OPERATOR ALGEBRA
To formalize the evolutionary trajectory as a rigorous control process, we decompose the state transition function into atomic operations that correspond to the canonical phases of iterative optimization: observation, attribution, proposal, verification, and commit. Consequently, we establish five necessary auxiliary spaces to ensure the process is mathematically well-defined. The trace space guarantees system observability; the hypothesis space provides the basis for semantic error attribution; the modification space formalizes the modification primitives; the objective specification defines the optimization landscape; and the evaluation space encapsulates performance metrics and safety status. These components constitute the minimal sufficiency required to close the self-evolution loop.
- Reflect (). Defined as , this operator bridges the gap between raw observation and optimization direction. It approximates the 'semantic gradient' of the system by mapping high-dimensional execution traces to specific, causal failure hypotheses within the variable space.
- Select (). Formulated as , this operator acts as the generative policy. It translates diagnostic hypotheses into concrete update proposals, sampling candidate modifications designed to minimize the identified error signal subject to structural constraints.
- Improve (). The mutation operator, , executes the physical state transition. It applies discrete updates via standardized RSPL interfaces to yield a provisional candidate state.
- Evaluate (). Specified as , this operator serves as the objective function. It maps the candidate state and goal specification to the evaluation space (comprising quantitative scores and strict safety invariants).
- Commit (). Operating as , this function acts as a conditional gating mechanism. It utilizes the evaluation signals in to govern state transition, rigorously enforcing safety invariants and performance monotonicity by accepting the candidate only when specific success criteria are met.
3.2.3. THE EVOLUTIONARY LOOP
The atomic operators defined above are orchestrated into a rigorous closed-loop process, summarized in Algorithm 1. Starting from an initial state , SEPL iteratively executes the system to generate observational traces (), derives causal failure hypotheses (), and synthesizes modification primitives ().
Crucially, the loop is closed via the evaluation space and the commit operator . This design ensures that self-evolution is not a random walk, but a directed trajectory that is grounded in execution data, traceable through versioned updates, and monotonically improving under strictly defined safety invariants.
4. AGS and Optimization Strategies
This section presents the concrete instantiation of the AGP protocol, demonstrating its practical usability as a self-evolving agent system.
4.1. AGS Architecture
Building on AGP, we instantiate the two-layer protocol into AGS, a self-evolving multi-agent system organized around an Agent Bus architecture. Rather than relying on a monolithic controller or a rigid pipeline, AGS uses a shared message bus as the central coordination backbone: all agents communicate exclusively through standardized bus messages, enabling loose coupling, transparent observability, and concurrent sub-agent execution. Throughout all configurations, prompts, tools (including native scripts, MCP tools, and agent skills), and memory are treated as first-class RSPL resources with explicit lifecycle and version lineage, rather than hard-coded internal components. The system operates through three interleaved mechanisms:
- Orchestration via Plan Generation. Upon receiving a task from the Agent Bus, the Orchestrator is responsible solely for planning and coordination; it does not execute subtasks directly. Concretely, the Orchestrator produces a structured
plan.mdartifact that records the overall task decomposition: a human-readable flowchart of the execution graph, an ordered list of subtask steps, and the assignment of each subtask to a designated sub-agent (e.g., deep researcher, browser-use agent, tool-calling agent, or tool generator). This plan is registered as a versioned RSPL resource, making the coordination structure itself inspectable and evolvable. The Orchestrator then broadcasts each subtask together with its specification to the corresponding sub-agents via the bus. - Concurrent Sub-Agent Execution and Iterative Replanning. Upon receiving a broadcast subtask, each sub-agent independently retrieves the relevant prompt and tool resources from the RSPL registry via semantic search, executes tool calls to interact with the environment, and writes intermediate results and reasoning traces to shared memory as persistent, queryable state. Sub-agents operate concurrently: the bus decouples task dispatch from task completion, so multiple sub-agents may execute in parallel without synchronization overhead. Once all sub-agents in the current round have completed, the Orchestrator collects their outputs via the bus, summarizes the aggregated results, and updates
plan.mdwith the current execution state. Based on this global view, the Orchestrator decides whether the task is complete or whether a further round of subtask decomposition and broadcast is required. This collect-and-replan loop repeats until the termination condition is satisfied, enabling the system to handle tasks of arbitrary depth and branching complexity. As a complementary pattern, AGS also supports agent-as-tool composition, in which a sub-agent is wrapped behind a standard RSPL tool schema and directly invoked by a tool-calling agent alongside conventional tools, MCP services, and skills, enabling lightweight multi-agent collaboration without bus-level orchestration. - Self-Evolution. Interleaved with the bus coordination loop, AGS triggers the SEPL evolutionary loop whenever observational traces signal correctable failures or suboptimal performance. Concretely, the agent (i) reflects on execution traces (tool outputs, errors, latencies, reward signals, and task progress) to derive causal failure hypotheses , (ii) selects targeted modification proposals over evolvable variables (e.g., prompt text, tool source code for native scripts, MCP tool configurations, skill definitions, or the plan structure itself), (iii) applies candidate updates to produce a provisional state , (iv) evaluates the candidate against the objective , and (v) commits accepted modifications as versioned transitions with auditable lineage and rollback. Failed evolution attempts are rolled back without side effects, and successful ones become immediately available to all sub-agents in subsequent bus rounds. This tight integration ensures that evolution is always safe, traceable, and composable across the full lifetime of the agent network.
4.2. Instantiating the Optimizer
The AGP protocol is agnostic to the specific optimization strategy: any procedure that conforms to the five-operator SEPL interface () can serve as the evolutionary engine. We describe the primary instantiation used in our experiments and briefly outline alternative strategies supported by our implementation.
- Reflection Optimizer. The default optimizer in our experiments implements the SEPL loop through natural-language reflection. Given an execution trace and the current evolvable state , the Reflect operator prompts the backbone LLM to analyze failures and generate structured diagnostic hypotheses in natural language (e.g., 'the prompt lacks explicit instruction for edge-case handling' or 'the sorting algorithm has complexity on the critical path'). The Select operator then translates these hypotheses into concrete modification proposals , such as appending constraint clauses to the system prompt or rewriting a function body. The Improve operator applies these proposals through the RSPL set variables interface to produce a candidate state. The Evaluate operator re-executes the task under the candidate state and compares performance against the objective . Finally, the Commit operator accepts the update only if performance improves or safety invariants are preserved, otherwise rolling back to the previous version. This reflection-driven loop is repeated for a fixed budget of rounds.
- Alternative Strategies. Beyond reflection, our implementation supports additional optimization strategies that map naturally onto the same SEPL operator interface:
- TextGrad treats the natural-language feedback produced by as a 'textual gradient' and applies gradient-descent-like updates to string-valued variables (prompts, code). Within AGP, TextGrad instantiates as a gradient-informed proposal generator and as a string-level edit operator, while reusing the standard and for evaluation and gating.
- Reinforce++ / GRPO adopt a reinforcement-learning perspective, treating the evolvable variables as a policy and the evaluation signal as a reward. Here, samples multiple candidate trajectories, ranks them by reward, updates the policy parameters (e.g., prompt weights or LoRA adapters) via policy-gradient estimates, and commits only if the updated policy exceeds a baseline return threshold. These strategies demonstrate that the SEPL operator algebra is sufficiently general to accommodate both inference-time text optimization and gradient-based parameter updates within a unified protocol.
5. Empirical Studies
In this section, we present empirical results of deploying AGS across various challenging benchmarks with AGP protocol to demonstrate its comprehensive capabilities.
Benchmark Instruction. For GPQA-Diamond (198 questions), we adopt a closed-book, non-retrieval evaluation protocol. The agent is presented with a graduate-level STEM multiple-choice question (covering biology, chemistry, and physics) and must output exactly one option as the final answer. GPQA-Diamond is designed to be Google-proof, such that simple web search is insufficient and success typically requires difficult, multi-step scientific reasoning beyond factual recall. Overall, this benchmark measures the agent's deep scientific understanding and closed-book reasoning ability. For AIME, we use problems from the 2024 and 2025 American Invitational Mathematics Examination (AIME24 and AIME25), each consisting 30 problems. Each instance requires the agent to solve a competition-level problem and output a single integer answer. We evaluate performance by exact-match accuracy, which primarily measures the agent's long-horizon symbolic reasoning and arithmetic precision. For GAIA, we evaluate on the GAIA Test split (300 tasks). Each task specifies a real-world, multi-step objective that typically requires planning and tool use (e.g., web browsing and document/file operations). We measure performance by task success (completion), which primarily reflects the agent's long-horizon planning and reliable tool-use execution. For LeetCode, we construct an in-house, LeetCode multi-language programming benchmark to evaluate executable code generation under reduced data contamination. To mitigate potential training-data contamination from widely circulated legacy problems, we intentionally select recently released problems across diverse categories (e.g., arrays, trees, linked lists, etc.) and split them into 200 training problems and 100 test problems. The agent solves each problem in one of multiple languages (Python, C++, Java, Go, etc.), and we report multiple metrics including overall score (acceptance), test-case pass rate, and runtime, which together measure algorithmic reasoning, implementation correctness, and efficiency.
5.1. Experiments on Scientific and Mathematical Benchmarks
5.1.1. EXPERIMENT SETTING
To validate our self-evolving agent AGS based on the AGP protocol, we conduct experiments across GPQA-Diamond, AIME24, and AIME25, focusing on evolving prompts and agent outputs. These benchmarks represent standard reasoning tasks where evolution of agent architecture, memory systems, environments, and tools is relatively less critical compared to instruction refinement and solution quality. To isolate the self-evolution capability on prompts and solutions, we deliberately do not equip AGS with any external tools in this setting, and compare three evolution strategies: evolve prompt only, evolve solution only, and the combined evolve prompt+solution. To ensure comprehensive coverage across model capabilities, we evaluate using multiple backbone models: lower-performing models (gpt-4o, gpt-4.1), a medium-performing model (claude-sonnet-4.5), and a high-performing model (gemini-3-flash-preview, grok-4.1-fast). Our self-evolution algorithm primarily employs the reflection optimizer with a maximum of 3 optimization rounds, after which the agent output is taken as the final solution.
Metrics. We measure performance by exact-match accuracy: for GPQA-Diamond, the agent's selected option must match the ground-truth multiple-choice answer; for AIME24 and AIME25, the agent's numerical output must exactly match the reference integer answer.
5.1.2. RESULTS AND ANALYSIS
The results in Table 1 reveal four key observations across models and evolution strategies. (1) Weak models gain more; strong models gain less. gpt-4.1, with lower vanilla baselines (23.3% on AIME24, 20.0% on AIME25), improves by 71.4% on AIME24 and 66.7% on AIME25 under evolve prompt+solution. In contrast, gemini-3-flash-preview starts at 83-88% and improves by 2.3% on GPQA-Diamond and 12.0% on both AIME benchmarks. The reason is straightforward: evolution corrects errors exposed during reflection; weaker models make more correctable mistakes, whereas stronger models already operate near ceiling. claude-sonnet-4.5 occupies a middle tier (76-78% vanilla) and improves by 4.0%, 13.0%, and 22.7% on GPQA, AIME24, and AIME25, respectively, confirming that headroom correlates with evolution benefit. (2) Combined evolution dominates prompt-only and solution-only. Across all models, evolve prompt+solution consistently yields the best scores. For gpt-4.1 on AIME24, evolve prompt reaches 33.3% and evolve solution 36.7%, whereas the combined approach reaches 40.0%; on AIME25, the respective scores are 23.3%, 30.0%, and 33.3%. claude-sonnet-4.5 shows similar patterns: evolve prompt+solution outperforms either single strategy on all three benchmarks. This suggests that instruction refinement and solution refinement address complementary failure modes; combining both closes more errors than either alone. (3) Math benchmarks respond more strongly than science QA. AIME24 and AIME25 exhibit larger relative gains than GPQA-Diamond. For gpt-4.1, GPQA improves by 3.9% while AIME24 improves by 71.4%; for gemini-3-flash-preview, GPQA improves by 2.3% while both AIME benchmarks improve by 12.0%. Long-horizon symbolic reasoning (multi-step derivations, arithmetic chains) exposes more intermediate failure points that reflection can target; closed-book science QA, by contrast, relies more on factual recall where prompt/solution refinement offers fewer levers. (4) Ceiling effects cap evolution on saturated benchmarks. grok-4.1-fast reaches 96.7% on AIME24 with vanilla, leaving minimal headroom; evolution yields no gain there. It still improves GPQA and AIME25 by 7.2% and 7.4%, respectively, where baselines are lower. This reinforces that self-evolution is most effective when both model capability and benchmark difficulty leave room for improvement.
In summary, AGS delivers consistent gains across diverse model capabilities and benchmarks throughout our experiments. Stronger models improve modestly but reliably; weaker models improve substantially when sufficient headroom exists. The combined prompt+solution evolution strategy consistently outperforms single-strategy evolution, and math benchmarks benefit more strongly than science QA from iterative refinement.
5.2. Experiments on General Agent Benchmark
5.2.1. EXPERIMENT SETTING
For GAIA, we focus on evolving tools, as GAIA tasks primarily depend on tool capabilities rather than pure reasoning. Our system architecture consists of a top-level planner agent () and multiple specialized sub-agents: a deep researcher (), a browser-use agent (), a report agent, a tool generator (), and a deep analyzer agent (). All agents utilize gemini-3-flash-preview as the backbone model, where denotes the maximum number of reasoning steps per agent. The self-evolution of tools is primarily driven by the tool generator agent: given a subtask, it first retrieves candidate tools from the managed tool registry via semantic search; if a suitable tool is found, the agent attempts to execute it, and upon encountering errors, iteratively refines the tool's source code through reflection; if no suitable tool exists, the agent synthesizes a new tool from scratch and registers it as a versioned RSPL resource for future reuse.
Metrics. We adopt the Pass@1 score on the GAIA Test split and report task-completion accuracy at each difficulty tier (Level 1, Level 2, Level 3) as well as the overall average.
The results in Table 2 reveal three key observations. (1) AGS achieves state-of-the-art performance. With an average score of 89.04%, AGS surpasses all public leaderboard entries, outperforming the next-best agent ToolOrchestra (87.38%) by 1.66 percentage points. This advantage is especially pronounced on the hardest tier: AGS scores 81.63% on Level 3, compared to 69.39% for HALO and 57.14% for AWorld, demonstrating that evolution-driven adaptation provides the largest gains where task complexity is highest. On the easier tiers, the gap narrows but remains consistent: Level 1 reaches 98.92% (vs. 95.70% for ToolOrchestra) and Level 2 reaches 85.53% (vs. 84.91% for HALO), indicating that tool evolution provides broad-spectrum improvement rather than being limited to a single difficulty regime. (2) Tool evolution yields large gains on hard tasks. Compared to the vanilla baseline (79.07% avg.), evolve tool improves performance by 12.6% overall. The improvement is strongly skewed toward difficulty: Level 1 gains 8.2%, Level 2 gains 10.6%, and Level 3 gains 33.3%. This pattern mirrors the headroom effect observed in the math benchmarks: harder tasks expose more correctable failure modes, which the reflection-driven tool evolution can target. Notably, the 33.3% gain on Level 3 represents the single largest relative improvement across all benchmarks in our study, underscoring that tool evolution is particularly effective when tasks demand complex multi-step tool chains that static toolkits cannot adequately cover. (3) Hierarchical resource management mitigates planning complexity. GAIA's multi-domain tasks require temporal and cross-modal state coherence, and many baselines degrade during domain transitions (e.g., from browser retrieval to local file analysis). By treating prompts, tools, and environments as first-class RSPL resources with explicit lifecycle management, AGS preserves session-critical state across agent boundaries, reducing contextual forgetting and enabling compositional generalization on Level 2 and Level 3 scenarios. Furthermore, when the planning agent encounters novel subtasks, it invokes the tool generator to synthesize context-specific functionalities on the fly, bypassing the fixed-capability bottleneck of static agent toolkits. This dynamic tool creation and refinement loop, mediated entirely through the SEPL operator interface, ensures that new capabilities are version-tracked and reusable across subsequent tasks.
In summary, GAIA confirms that AGP's self-evolution protocol extends beyond pure reasoning tasks to complex, tool-intensive agent scenarios. The largest gains emerge on the hardest task tiers, where iterative tool refinement and hierarchical resource management provide the most leverage.
5.3. Experiments on Algorithmic Coding Benchmark
5.3.1. EXPERIMENT SETTING
Benchmark design rationale. Our benchmark construction is driven by three motivations: (i) evaluating inference-time self-evolution on executable code, (ii) calibrating agent performance against the distribution of human submissions, and (iii) assessing cross-language robustness under long-tail language usage. We build on top of the LeetCode online judge, which provides an execution-based evaluation interface and rich feedback signals. Specifically, acceptance status and per-test-case pass rates enable fine-grained assessment of functional correctness beyond binary success. For accepted submissions, the platform reports runtime and memory usage along with percentile-based runtime beats and memory beats statistics computed against the distribution of human submissions, which directly supports human-referenced evaluation. Finally, LeetCode provides standardized starter code across many programming languages, enabling consistent and reproducible multi-language evaluation under a unified protocol.
Data collection. We collect the full set of 3,822 programming problems available on LeetCode at the time of crawling. For each problem, we extract the natural-language statement, official input-output examples, and language-specific starter code templates. Each problem is annotated with its platform-provided difficulty label (Easy, Medium, Hard) and topical tags describing required algorithmic concepts (e.g., arrays, trees, dynamic programming). We perform quality checks including filtering malformed records, removing duplicates, and validating successful parsing of statements, examples, and templates. From the full pool, we select 100 recently released test problems across diverse categories to mitigate training-data contamination.
Evaluation protocol. We compare a vanilla baseline against AGS with evolve solution enabled. For the vanilla baseline, the agent presents a fixed input representation to the model, deterministically extracts executable source code, and submits it to the execution-based judge in a single pass. For AGS, the agent iteratively refines solutions through the SEPL reflection optimizer within a fixed revision budget of 3 rounds, while keeping the task specification and evaluation interface unchanged. This controlled setup enables direct comparison between one-shot generation and inference-time self-evolution on solution quality. We evaluate across five languages (Python3, C++, Java, Go, Kotlin) using multiple backbone models and report multi-dimensional metrics.
Metrics. As shown in Table 4, we report three groups of metrics that capture complementary aspects of coding performance. First, capability metrics measure functional correctness and failure modes under the judge constraints. Second, efficiency metrics summarize runtime and memory cost for accepted submissions. Third, human metrics quantify how often accepted submissions outperform the distribution of human solutions in runtime and memory.
The results in Table 3 and Figure 2 reveal four key findings across coding capability, efficiency, and human-referenced dimensions. (1) Self-evolution consistently improves pass rate across all languages. The evolve solution agent achieves relative pass-rate improvements ranging from 10.1% (Python3) to 26.7% (Kotlin), with compiled languages benefiting most: C++ reaches 99 and Java 98 out of 100 problems. These gains are accompanied by broad reductions in execution-blocking errors; compile error, runtime error, timeout, and response error frequently drop to zero, indicating that iterative refinement effectively repairs format and tooling issues that cause outright failures. Figure 2 (first row) corroborates this finding, showing consistently higher pass-rate trajectories for the evolving agent as problems accumulate. (2) Evolution improves runtime efficiency but shows mixed memory effects. Average runtime decreases in every language, with reductions of 7.8% in Python3 and 19.8-46.4% in compiled languages. Figure 2 (second row) confirms this trend: the evolving agent accumulates substantially lower cumulative runtime, and the gap widens as tasks accumulate. This pattern aligns with reductions in TLE errors, suggesting that reflection helps replace suboptimal algorithms with more efficient ones. Memory usage, however, shows a mixed trend: it decreases in C++, Java, and Go but increases modestly in Python3 and Kotlin, plausibly because the evolving agent introduces auxiliary data structures to ensure correctness or improve speed. (3) Evolved solutions become more competitive against human submissions. Runtime beats (ARB) increase strongly in compiled languages, with gains of 30.8% in C++ and 24.4% in Java, and smaller gains in Go (7.0%) and Kotlin (0.1%). Memory beats (AMB) increase in Python3, C++, Java, and Go, but decrease in Kotlin (15.0% ↓). Figure 2 (third row) shows that the evolving agent sustains higher ARB and AMB trajectories than the vanilla agent in most settings, indicating that competitiveness against human submissions improves consistently over the inference trajectory. The Kotlin divergence mirrors the absolute memory trend and suggests that in long-tail languages the evolving agent may trade memory for correctness or speed. (4) Within-inference trajectories reveal compounding improvement dynamics. Beyond endpoint metrics, Figure 2 enables trajectory-level analysis of self-evolution. Across all three metric groups, the gap between evolving and vanilla agents widens as problems accumulate rather than plateauing, suggesting that the reflection-driven optimizer continues to find correctable failure modes throughout the evaluation. This compounding behavior is most pronounced in the runtime panel, where cumulative efficiency gains accelerate in later problems.
In summary, self-evolution on the algorithmic coding benchmark delivers consistent improvements in functional correctness and runtime efficiency across all five languages, with the largest gains in compiled languages where the type system and compiler feedback provide richer signals for reflection. Human-referenced metrics confirm that these gains translate into solutions that are increasingly competitive with human submissions. The within-inference trajectory analysis further demonstrates that AGP not only improves endpoint scores but also enables fine-grained visibility into when and how self-evolution provides the most leverage during a single inference episode.
6. Conclusion
We presented AGP, a two-layer self-evolution protocol that decouples what evolves from how evolution occurs. The Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as first-class, versioned resources with explicit lifecycle and interface contracts. The Self-Evolution Protocol Layer (SEPL) specifies a closed-loop operator algebra for proposing, evaluating, and committing improvements with auditable lineage and rollback. Building on this protocol, we instantiated AGS, a thinking-and-action agent that dynamically retrieves, refines, and evolves heterogeneous resources during execution. We believe this protocol-level approach to self-evolution provides a principled foundation for building modular, traceable, and safely improvable agentic systems.
References
- Anthropic. Equipping agents for the real world with agent skills. http://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills, 2025a. Accessed October 2025.
- Anthropic. Introduction to agent skills. https://anthropic.skilljar.com/ introduction-to-agent-skills, October 2025b.
- Hu, J. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025a.
- Hu, J. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025b.
- LeetCode. Leetcode online judge. https://leetcode.com. Accessed 2025.
- Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023.
- Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024.
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Lu, P., Huang, Z., Guestrin, C., and Zou, J. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609-616, 2025.
A. Notation
We summarize the main mathematical symbols and their meanings in Table 5. For readability, the notation is grouped by functional categories (grey rows), covering the RSPL substrate (resource entities, registration records, and registries) and the SEPL layer (evolvable variables, auxiliary spaces, and operator definitions used in the optimization loop).
B. Comparison with Other Protocols
We provide a structured comparison between Autogenesis, Google A2A, and Anthropic MCP in Table 6. The goal of this comparison is to position Autogenesis relative to widely used protocol abstractions in agent tooling, and to clarify which protocol-level primitives are required to make self-evolution composable, auditable, and safe in practice. Accordingly, the comparison is organized into four high-level dimensions (grey rows): Basic Information, Agent and System Capabilities, Evolvable Resource Management, and Self-Evolution Mechanism. Blue-highlighted entries emphasize the specific capabilities that enable closed-loop improvement (e.g., lifecycle control, version lineage, contract generation, and operatorized updates), which are not directly addressed by communication- or invocation-centric protocols.
B.1. Basic Information
Proposer: This dimension identifies the originating organization and design context of each protocol. Google's A2A is introduced as part of an agent communication framework, focusing on enabling agents to collaborate via standardized interaction primitives. Anthropic's MCP (Model Context Protocol) is designed to standardize how LLMs connect to external tools and resources. Autogenesis is proposed in this work as a protocol for systematic self-evolution, targeting composable, auditable, and updateable agentic systems.
Protocol Focus: This dimension describes the primary interaction patterns and control plane each protocol standardizes. Autogenesis focuses on enabling closed-loop improvement of agentic systems by organizing resources and updates through protocol operators and versioned state. A2A focuses on multi-agent collaboration and communication. MCP focuses on standardizing model-to-tool (and resource) invocation interfaces.
Entity Scope: This dimension defines what is treated as first-class, protocol-governed components. Autogenesis explicitly manages heterogeneous entities (e.g., prompts, agents, tools, environments, and memory) as protocol-registered resources with explicit state and lineage, which is necessary for component-level evolution (e.g., prompt refinement, tool/code updates). A2A centers around agents (and their interactions), and typically does not establish tools/environments/memory as unified managed entities. MCP treats tools/resources as callable interfaces for LLMs, but does not natively model them as evolvable components with lifecycle and version lineage.
B.2. Agent and System Capabilities
Agent First-Class: First-class support means agents are modeled as managed protocol components with explicit schemas, metadata, and lifecycle hooks (enabling registration, discovery, orchestration, and controlled updates). Autogenesis supports agents as first-class resources. A2A provides agent-centric collaboration but often treats agents as service endpoints without unified lifecycle/version lineage. MCP does not define agents as protocol components, focusing instead on model-to-tool connectivity.
Multi-Agent: This dimension captures whether the protocol natively supports multi-agent composition beyond ad-hoc application logic. Autogenesis supports multi-agent configurations as part of a broader system substrate, enabling coordinated execution with traceability and evolution-ready state. A2A provides direct support for agent-to-agent collaboration. MCP does not address multi-agent orchestration as a protocol concern.
Tracer/Observability: Observability refers to whether the protocol provides native mechanisms to record execution traces (inputs/outputs, intermediate decisions, tool calls, state transitions) for debugging, evaluation, and learning signals. Autogenesis includes protocol-level tracing to support auditable evolution. A2A and MCP typically leave tracing to application-level implementations, which can lead to inconsistent observability.
Memory as Resource: This dimension reflects whether memory is explicitly modeled and managed as a protocol-level component. Autogenesis treats memory as a first-class resource (e.g., readable/writable state with explicit interfaces), enabling persistent improvement and reproducible evolution. A2A and MCP generally do not prescribe a memory management protocol, leaving memory to external systems.
B.3. Evolvable Resource Management
Lifecycle Ops: Lifecycle operations refer to standardized procedures for initializing, registering, constructing, and decommissioning protocol-managed components. Autogenesis provides explicit lifecycle operators so that updates can be applied safely to well-defined targets. A2A and MCP do not provide comprehensive lifecycle management across heterogeneous component types.
Versioning and Rollback: Version lineage and rollback provide the foundation for safe evolution: every update yields an auditable snapshot, supports comparison, and enables restoration when regressions occur. Autogenesis integrates version management as a protocol capability. A2A and MCP do not natively support version lineage for protocol-managed components, making systematic evolution difficult.
Registry and Retrieval: This dimension captures whether the protocol supports unified registration, listing, and retrieval of components (optionally via semantic search) to enable reuse and scalable coordination. Autogenesis maintains a registry of protocol-registered components and supports retrieval to reduce duplication and improve composability. A2A and MCP provide partial discovery mechanisms but do not define a unified management plane over heterogeneous components.
Contract Generation: Contract generation refers to producing consolidated, up-to-date capability and constraint specifications (e.g., tool actions, arguments, preconditions, usage constraints) for reliable orchestration and reduced prompt bloat. Autogenesis supports contract generation as a systematic form of context engineering. A2A and MCP generally rely on static descriptions or application-layer documentation without protocol-level contract aggregation.
B.4. Self-Evolution Mechanism
Closed-Loop Evolution: Closed-loop evolution means the protocol supports an iterative improvement loop (execute → diagnose → propose → verify → commit) rather than one-off adaptation. Autogenesis is explicitly designed around this loop to enable sustained improvement. A2A and MCP do not provide a native self-evolution loop.
Operatorized Updates: This dimension captures whether system updates are expressed as a typed, composable operator interface (rather than ad-hoc scripts), enabling controlled state transitions and repeatable evolution. Autogenesis defines self-evolution as operator-mediated transitions over protocol-managed resources. A2A and MCP do not define an operator algebra for evolution.
Auditability: Auditability means that system changes are traceable and reviewable: what changed, why it changed, under what evidence, and with what evaluation outcome. Autogenesis emphasizes auditability through versioned lineage and trace-based evaluation signals. A2A and MCP provide only partial audit trails via external tooling rather than protocol-level guarantees.
B.5. General and Ecosystem
Model-Agnostic: This dimension captures whether the protocol can work across different LLM backends and providers. Autogenesis is model-agnostic by design via a unified model interface layer. A2A and MCP are also broadly model-agnostic as they define interaction standards rather than binding to a specific model.
Scalability: Scalability reflects how coordination and discovery behave as the number of components grows. Autogenesis supports scalable management by treating heterogeneous components as registry-governed resources with retrieval mechanisms, enabling efficient lookup and controlled orchestration. A2A may face coordination overhead as interactions densify in large multi-agent settings. MCP standardizes tool interfaces but may still rely on application-level orchestration for large tool/resource sets.
Open Ecosystem: Open ecosystem support refers to whether the protocol can enable a reusable ecosystem of interoperable components. Autogenesis provides a full protocol stack for managing, evolving, and auditing agentic components, which supports component sharing and safe integration. A2A and MCP offer partial ecosystem enablement focused on interoperability or tool interfaces, typically requiring additional layers for evolution-ready management.
C. Details of Self-Evolution Protocol
C.1. Layer 1: Resource Substrate Protocol Layer
The Resource Substrate Protocol Layer (RSPL) defines the evolvable substrate as a set of protocol-registered resources with explicit state, lifecycle, and version lineage. In this paper, these resources comprise (i) instructions (Prompt), (ii) decision policies (Agent), (iii) actuation interfaces (Tool), which encompass native tool scripts, MCP tools, and agent skills, (iv) task/world dynamics (Environment), and (v) persistent state (Memory). Crucially, resources in RSPL are passive: they encapsulate no optimization logic and cannot self-modify; all observations and state transitions occur only through controlled, interface-mediated operations invoked by higher layers.
C.1.1. CORE ENTITIES
We focus on these five entity types as a minimal yet expressive substrate for agentic systems. This choice is not intended to be exhaustive, but rather to identify a common denominator across modern agent stacks and provide a uniform target space on which SEPL can operate.
Definition C.1 (Resource Entity). A resource entity of type and its type-level collection can be represented as: , where denotes the set of RSPL entity types, indexes the entity type, is the index set of resource instances of type , and indexes an individual instance. Here is a unique resource name, is a short description, is an input-to-output mapping, is the trainable marker that indicates whether the resource is evolvable, and is an auxiliary metadata dictionary.
A key motivation for making prompt, tool, and memory explicit RSPL resources is decoupling. Many agent systems package prompts, tools, and memory as internal components of an agent, which entangles agent logic with task-specific instructions and capability bundles, increasing maintenance and limiting transfer. By externalizing them as first-class, versioned resources with standardized interfaces, the same tool-calling agent policy can be paired with different prompts and tool sets, and deployed unchanged across tasks and environments.
To support resource registration, unified management, and instantiation, RSPL stores a serializable registration record for each resource instance.
Definition C.2 (Resource Registration Record). A resource registration record and its type-level collection can be represented as: , where indexes the entity type and indexes an individual instance. Here is the resource entity tuple defined in Theorem C.1, is a version string, is an implementation descriptor (e.g., import path, class definition, or source-code string), are instantiation parameters (e.g., constructor arguments), and is a set of exported representations used by LLMs to interact with the resource (e.g., function-calling schema, natural-language text, and structured argument schema).
Definition C.3 (Protocol-registered resource). For each entity type , let denote the type-specific registry of protocol-registered resources, and let denote the global registry. RSPL binds each entity type to a dedicated context manager and a server-exposed interface . We represent the type-level registered resource as , where each is a registration record in Theorem C.2. The context manager maintains the collection , the version lineage for type , and implements lifecycle and update operations over these records; the server-exposed interface encapsulates and exposes a unified external interface by delegating requests to the corresponding context-manager routines.
C.1.2. CONTEXT MANAGER
The context manager implements the management plane for each resource type. Beyond lifecycle control and dependency constraints, it maintains (i) an active registry of materialized resources and (ii) a versioned history for restoration. Its exported API can be viewed as a small set of functionally grouped operators for lifecycle and registration (e.g., init, build), retrieval and inspection (e.g., list, get state), evolution and versioning (e.g., update, restore), execution and contract (e.g., run, load contract), and serialization and deserialization (e.g., save to json, load from json). The manager explicitly supports contract generation, producing a consolidated capability and constraint specification for the managed entities, which provides stable, up-to-date descriptions that improve reliability and reduce prompt bloat, enabling systematic context engineering via controlled prompt injection. For instance, for tools (which may be native tool scripts, MCP-connected tools, or agent skills) the contract can take a skills.md-style form that enumerates tool actions, arguments, preconditions, and usage constraints. The exported management interface implemented by and exposed by are as follows: