500 Seed Samples, Four Self-Evolving Agents: Reasoning Capability Surges by 10.7%

LiveCodeBench scores rose by 8.9%, and OlympiadBench by 10.7%—figures generated by a framework that kickstarted training with merely 500 seed samples. Without massive-scale human annotation or external teacher models, four agents diverging from a single LLM challenged each other with problems, evaluated one another, and co-evolved, ultimately elevating the base model's reasoning capabilities to a new tier.

The framework proposed in this paper, called SAGE (Self-evolving Agents for Generalized reasoning Evolution), operates on a core concept: enabling a single LLM to simultaneously assume the roles of Problem Poser, Planner, Solver, and Critic, completing a self-training loop through adversarial collaboration.

The Bottleneck in Reinforcement Learning for Reasoning

RLVR (Reinforcement Learning with Verifiable Rewards) has proven effective in boosting LLM reasoning, with works like DeepSeek-R1 serving as prime examples. However, the limitations are stark: these methods rely heavily on large-scale human-annotated datasets to provide verifiable reward signals. This creates a scalability bottleneck as model capabilities approach or even surpass human levels.

Recent self-play and multi-agent approaches have attempted to address this dependency. For instance, SPIRAL utilizes zero-sum game environments for autonomous improvement, while Absolute Zero allows models to generate their own programming problems and solve them. Yet, the paper points out that these methods generally suffer from two shortcomings: a lack of explicit planning capabilities for complex, multi-step reasoning tasks, and insufficient quality control leading to instability during long-range training.

Four Agents, One Closed Loop

SAGE's architectural design is remarkably ingenious. The four agents share the same LLM backbone but differentiate their functions through distinct role-specific instructions:

Challenger (Problem Poser): Samples reference problems from a small seed set to generate new, more difficult questions along with their validators (standard answers or executable test cases). Its reward comprises three components, each weighted at 1/3: a quality score from the Critic, a difficulty reward based on the Solver's failure rate, and a format reward.

Planner: Upon receiving a problem, it generates a structured, multi-step solution plan. The Critic scores the plan's quality; only plans exceeding a gating threshold (set at β=0.3 in the paper) are passed to the Solver. Otherwise, the Solver attempts the problem directly.

Solver: Generates the final answer based on the problem and the approved plan. Its reward is a weighted combination of the plan quality score, validator correctness score, and format reward, with weights of (0.2, 0.6, 0.2) respectively—placing the greatest emphasis on correctness.

Critic: Provides two types of signals: a soft score for output format and a quality score (1-10, normalized to [0,1]) for problems generated by the Challenger and plans by the Planner. Crucially, correctness judgments are performed by external validators, not the Critic itself, avoiding the circular bias inherent in self-evaluation.

Figure 1: SAGE Framework Overview

[Figure 1: SAGE Framework Overview] Four specialized agents—Challenger, Planner, Solver, and Critic—interact through quality filtering and format validation to achieve closed-loop self-evolution.

Figure 2: SAGE Training Process

[Figure 2: SAGE Training Process] (1) The Challenger generates problems from reference examples, filtered by the Critic for quality; (2) Verified problems expand the dataset; (3) Sampled problems are processed by the Planner and Solver to generate solutions; (4) All agents are jointly updated via Task-Relative REINFORCE++, utilizing role-normalized advantage functions.

A noteworthy design feature here is the quality filtering and difficulty suppression mechanism. When the Critic's quality score falls below the threshold α=0.7, the problem is excluded from the training set, and the difficulty reward component is removed entirely. This prevents problems that are "seemingly difficult but actually erroneous" from contaminating the training signal. This mechanism is critical for the stability of long-term self-training.

The joint update for all agents employs the Task-Relative REINFORCE++ algorithm. Its core lies in calculating the mean and standard deviation of the advantage function separately for each role to normalize them, effectively addressing coordination issues in training under heterogeneous multi-agent objectives.

What Can 500 Data Points Achieve?

The paper's training set consists of only 500 samples: 156 from MATH, 148 from GSM8K, 87 from HumanEval, and 109 from MBPP. Evaluation covers two domains: mathematical reasoning (GSM8K, MATH, AIME'24, AIME'25, OlympiadBench, AMC'23) and code generation (HumanEval+, MBPP+, LiveCodeBench v1-v5). Base models include Qwen-2.5-3B-Instruct, Qwen-2.5-7B-Instruct, and Qwen-3-4B-Base.

Table 1: Main Results on Reasoning Benchmarks

[Table 1: Main Results on Reasoning Benchmarks] Comparison of pass@1 accuracy across three model scales for post-training methods. SAGE achieves the best overall performance on all three model backbones.

Key figures: On Qwen-2.5-7B, SAGE improved LiveCodeBench scores from 17.5% to 26.4% (+8.9%) and OlympiadBench from 28.0% to 38.7% (+10.7%). The overall average rose from 47.6% to 50.1%.

Regarding baselines, while AZR and MAE showed improvements on certain individual benchmarks, their performance was inconsistent and sometimes regressive. For instance, AZR caused the Math Avg. on Qwen-3-4B to plummet from 56.3% to 46.7%. In contrast, SAGE exhibited no performance degradation across any benchmark group.

Table 2: In-Distribution vs. Out-of-Distribution Generalization

[Table 2: In-Distribution vs. Out-of-Distribution Generalization] On the 7B model, SAGE improved OOD (Out-of-Distribution) averages by 4.2% without sacrificing in-distribution accuracy.

However, it must be noted that on the more capable Qwen-3-4B, SAGE's overall improvement narrowed to just 0.2% (55.7% → 55.9%), with gains concentrated mainly in LiveCodeBench (+9.1%). This suggests that when the base model is already sufficiently strong, the marginal returns of self-evolution diminish.

Ablation Studies and Training Dynamics

Table 3: SAGE Component Ablation Study

[Table 3: SAGE Component Ablation Study] Impact of removing individual agent training on Qwen-2.5-3B. Removing Solver training resulted in the largest overall decline.

Ablation results indicate that removing Solver training caused the greatest drop in overall average (42.0% → 38.2%). Removing Challenger training primarily affected code benchmarks, with LiveCodeBench scores plummeting from 16.9% to 9.0%. Removing Critic training had little impact on math but severely hampered code performance. The adversarial interaction between Challenger and Solver forms the core evolutionary loop, while the Critic provides indispensable quality control.

Figure 3: Training Dynamics on Qwen-2.5-3B

[Figure 3: Training Dynamics on Qwen-2.5-3B] The Challenger continuously expands the problem bank during training. Validation accuracy peaks around 100-120 steps before gradually declining, suggesting potential over-specialization on self-generated curricula.

An interesting finding: the number of effective problems grew from 1,136 to 20,532 (an 18-fold expansion) during training, yet validation accuracy began to decline after peaking at 69.5% around step 100. Growth in problem quantity does not automatically equate to performance gains. This highlights the importance of curriculum diversity and difficulty calibration, explaining why the paper reports main experimental results at approximately 100 steps.

Final Thoughts

SAGE currently operates only in domains where correctness can be automatically verified (mathematics, programming); it still requires 500 seed samples to initiate; and its evaluation scope is limited to math and code. Furthermore, the over-specialization trend revealed in the training dynamics analysis implies that actual deployment will require monitoring training curves and implementing early stopping.

SAGE demonstrates a compelling technical path: using a minimal amount of seed data to initiate multi-agent closed-loop evolution, enabling LLMs to achieve continuous self-improvement on reasoning tasks. The division of labor among the four roles—posing, planning, solving, and critiquing—is indispensable, collectively supporting the balance between training signal quality and curriculum difficulty. Whether this paradigm can break through the boundaries of verifiable domains to enter more open-ended reasoning scenarios remains a direction worthy of close attention.

Conclusion Image

Original Title: SAGE: Multi-Agent Self-Evolution for LLM Reasoning
Original Link: https://arxiv.org/abs/2603.15255

#WuyingTemple