SkillOpt: Microsoft Proposes Self-Evolving Agent Skills

When building complex Agent systems, engineering teams often face a stubborn obstacle: when confronting real-world tasks involving complex multi-step execution, tool calling, and file processing, frontier closed-source LLMs in a frozen state often lack the "procedural knowledge" required by specific domains. Traditional remedies include relying on manually written giant System Prompts, using single error trajectories for self-correction, or extracting trajectory logs for rewriting. These conventional approaches severely lack the "controllability" common in deep network training; unconstrained text rewriting easily triggers catastrophic forgetting, while the absence of rigorous test set isolation causes the system's evolved "new skills" to completely overfit to a single failure case.

Microsoft, in collaboration with multiple universities, has introduced the SkillOpt framework. It directly discards the fragmented Prompt Engineering paradigm, proposing to treat the Agent's "skill document" as an external trainable state, introducing a complete mechanism of forward propagation, backward propagation, text learning rate, validation set interception, and epoch-level slow updates. It provides engineering teams—trapped by the inability to fine-tune closed-source model weights yet urgently needing to improve Agent procedural execution capabilities in vertical business scenarios—with a highly standardized, out-of-the-box infrastructure design paradigm.

Cracking the Black Box of Agent Procedural Adaptation

This paper addresses the real business problem of "Domain Adaptation of Large Language Models in Multi-step Execution Environments."

In complex data processing, code generation, or long-chain multimodal reasoning tasks, adapting to the target domain requires the system to possess correct calling conventions, format constraints, evidence collection sequences, and failure handling patterns. When model weights are immutable, optimizing the external skill document becomes the only adaptive channel. Current mainstream Agent self-evolution mechanisms (such as simple error-based retries, Trace2Skill trajectory distillation, GEPA reflective evolution, or EvoSkill skill evolution) universally suffer from the following engineering defects:

Overfitting to Single Samples: Reflection on a single trajectory often yields highly specific patches lacking generality.
Unstable Semantic Jumps: Lacking "learning rate" or "step size" constraints, newly generated Prompts replace large portions of old content, causing loss of already-mastered skills.
Lack of Out-of-Sample Validation: Unintercepted modifications enter deployment directly, leading to performance degradation on unseen data.

SkillOpt's design philosophy is ruthlessly engineering-oriented: it treats textual skill editing entirely as a controllable domain adaptation training process. Outside the frozen execution agent, it introduces an independent "Frontier Optimizer Model," supplemented by training set collection, minibatch reflection, step size control, interception mechanisms, and other classic machine learning techniques.

Core Methodology

The SkillOpt framework's operation highly mirrors a deep learning optimizer. Its core innovation lies in perfectly mapping complex model gradient update logic onto pure textual skill document maintenance. Specifically, the system architecture divides into an Execution Model (Target Model, responsible for running tasks in the environment) and an Optimizer Model (responsible for analyzing trajectories and generating document edit instructions).

Figure 3: Parameter Mapping and Training Loop

2.1 Parameter Mapping and Basic Settings

In SkillOpt's context, the Agent's entire adaptive process is reconstructed as:

Model Parameters correspond to an independent Markdown Skill Document.
Gradient Direction corresponds to structured textual edit suggestions (Add/Delete/Replace) derived from multiple historical trajectories.
Learning Rate corresponds to the maximum allowed text edit entries per update (Edit Budget).
Validation Mechanism corresponds to an independent validation set test gate with absolute veto power (Held-out Selection Gate).
Stable Training Mechanism corresponds to batch processing, learning rate scheduling, and epoch-level slow updates.

Before training, the system strictly partitions the dataset into a training set, a validation set (called Selection split in the paper), and a final test set. All trial-and-error and trajectory reflection occur solely on the training set.

2.2 Forward and Backward Propagation: Trajectory Collection and Minibatch Reflection

Forward Propagation (Rollout Evidence): In each optimization Step, the Execution Model carries the current Skill Document and executes a Batch of tasks sampled from the training set. The system records task metadata, message flows, tool call logs, command-line outputs, final answers, and environmental feedback in detail. These trajectory data constitute the raw material for optimization. To expose systematic defect patterns, SkillOpt employs a large Rollout Batch (default 40 trajectories/step) to accumulate sufficient statistical evidence before skill modification.

Backward Propagation (Minibatch Reflection): The Optimizer Model takes over these scored trajectories. It strictly separates successful from failed trajectories and further splits them into minibatches (default size 8). Through minibatch processing, the optimizer must find "common procedural errors" across multiple failure samples, completely eliminating the behavior of writing specific patches for a single error. For the failure group, the optimizer proposes corrective rules; for the success group, it proposes working patterns to retain or solidify.

2.3 Text Learning Rate and Controlled Updates (Bounded Text Updates)

To prevent destructive wholesale rewrites in a single iteration, SkillOpt enforces "Bounded Text Updates." It introduces an Edit Budget (analogous to learning rate). After collecting local modification suggestions from each minibatch, the Optimizer Model globally aggregates, deduplicates, and ranks the edit pool by expected utility, finally forcibly trimming to the Top-K edit actions (insertions, replacements, deletions).

The default system scheduler employs a Cosine annealing strategy, allowing larger restructuring initially (e.g., 4 modification suggestions) and gradually decaying to minimal step-size local fine-tuning (lower bound of 2) as training epochs progress.

2.4 Extremely Strict Validation Set Gating Mechanism (Validation Gate)

This is SkillOpt's core module for avoiding overfitting. All selected K text edit actions are merged to generate a Candidate Skill. The Execution Model must carry this Candidate Skill and re-run benchmarks on the independent validation set.

The interception rule is extremely strict: The Candidate Skill's score on the validation set must be strictly greater than the Current Skill's score to be accepted and set as the new Current Skill. All ties or score decreases are directly discarded. This uncompromising gatekeeping ensures that "plausible but specious diagnoses" at the text level cannot cause substantive harm to actual execution.

2.5 Rejected-Edit Buffer

Under the strict validation mechanism, numerous candidate modifications are rejected. SkillOpt establishes an epoch-local buffer recording text edit actions that were attempted but caused score drops, along with the failure patterns they attempted to address. In subsequent analyses within the same epoch, the Optimizer Model reads this history to avoid modification paths already proven ineffective, effectively injecting zero-cost negative feedback memory into the training loop.

2.6 Epoch-Level Slow Updates and Meta Skill

Rapid Step updates handle current Batch issues, while cross-Epoch analysis captures long-term patterns. At each Epoch's end, SkillOpt compares "Previous Epoch Skill" vs. "Current Epoch Skill" performance on the same training samples, classifying outcomes as: performance improvement, performance regression, stubborn failures, and stable successes.

Based on this longitudinal comparison report, the Optimizer generates two artifacts: First, Slow Update Guidance: written into a "Protected Region" in the Skill Document demarcated by special Markdown markers. Regular Step-level fast fine-tuning cannot modify this region, ensuring core domain strategies survive across cycles. Second, Meta Skill: a purely internal guide for the Optimizer itself, recording "which text modification styles are more likely to be accepted by the validation set in this specific environment, and which tend to backfire." It never participates in final deployment, existing only in the training phase context.

Implementation Details

SkillOpt demonstrates high engineering rigor in low-level modular decomposition, using structured JSON contracts to decompose reflection, merging, and scoring into independently orchestratable Agent chains.

1. Error Analysis Flow (analyst_error.md): The Optimizer Model receives multiple failure trajectories and must follow strict rules: identify cross-sample common error patterns, output JSON containing batch_size and structured failure_summary lists. It is forced to output only defect patches (Patches), not rewrite existing document content; the patch array includes specific operation types (append, insert_after, replace, delete), target location text, and new content.

2. Success Attribution Flow (analyst_success.md): Correspondingly, the optimizer observes successful samples to extract generalizable behavioral patterns. It is restricted to proposing only those operational regularities "not yet covered by the current Skill Document," preventing meaningless document bloat from duplicate instruction appending.

3. Merging and Adjudication (merge_final.md): The system generates multiple independent patch pools. At the final integration node, merging rules explicitly stipulate: failure-fix patches enjoy absolute priority. If a failure patch conflicts directly with a success pattern, the system defaults to retaining the failure-fix logic. This node is also prohibited from touching the cross-Epoch read-only region tagged with .

4. Ranking and Truncation (ranking.md): This module implements the text learning rate. The optimizer receives all valid patches and scores them on four dimensions: systemic impact (rules solving 50% of failures prioritized over edge cases), complementarity (filling existing skill gaps), generality (abstract principles over specific entity bindings), and executability (concrete guidance over vague suggestions). The system finally truncates output to the required number of edit indices, completing learning rate control.

Experimental Results

Under an extremely restrained framework design, SkillOpt's experimental performance exhibits terrifying dominance. Test benchmarks cover six major suites: SearchQA (search QA), SpreadsheetBench (complex spreadsheet code operations), OfficeQA (document reasoning), DocVQA (visual QA), LiveMathematicianBench (math reasoning multiple-choice), and ALFWorld (multi-step embodied environment decision-making).

Test models span top-tier GPT-5.5 series, GPT-5.4 series variants (mini, nano), and open-source small model systems like Qwen3.5-4B and Qwen3.6-35B-A3B. Execution environments comprehensively cover Direct Chat, Codex Harness with sandbox, and Claude Code Harness. All final reported metrics come from completely isolated independent test sets (Test Split).

4.1 Absolute Gains and Extreme Edit Economics

Across 52 model-test-environment grid cells, SkillOpt achieves best or tied-best in all 52 cells. It not only crushes the No-Skill baseline but comprehensively defeats human-written expert skills, single-step LLM-generated skills, and dynamic evolution frameworks like Trace2Skill, TextGrad, GEPA, and EvoSkill.

In GPT-5.5 Direct Chat mode, compared to the zero-skill baseline, SkillOpt lifts the average accuracy across six test sets from 58.8% to 82.3% (absolute gain +23.5 points). The most striking gains appear in process-critical domains: SpreadsheetBench jumps from 41.8% to 80.7%, OfficeQA soars from 33.1% to 72.1%. These gains radiate to small models too; GPT-5.4-nano on ALFWorld doubles from 34.3% to 69.4%.

What's staggering is the Edit Economy behind these leaps. The final deployable artifact best_skill.md is remarkably short, stably ranging 379-1995 tokens (median ~920 tokens). More critically, after many epochs of intense search, the number of modifications actually passed by the validation set and persisted into the final document (Edits) is merely 1-4 across all benchmarks (median 2.5). For instance, LiveMathematicianBench's +29.3 point absolute gain stems from just 1 accepted core edit. This constitutes the framework's hardest proof: the validation set gate acts like a sieve filtering 99% of overfitting noise, with the surviving 1-4 statements extracting pure domain muscle memory.

4.2 Training Cost Quantification

Introducing a powerful optimizer for multi-round interaction inevitably incurs Token costs. The paper details the cost matrix for each percentage point of absolute test-set gain. On execution benchmarks with shorter trajectories (spreadsheet processing, math), each 1% gain costs 0.6M-3.6M training tokens (e.g., OfficeQA's +39 points consumed 20.8M tokens). On multimodal long-text reading benchmarks (DocVQA), cost per point surges to 46.4M tokens. The core advantage: this compute cost is a one-time offline payment. After skill extraction, the lightweight Markdown document is deployed online with zero additional optimizer calls or weight loading burden at inference.

4.3 Three-Dimensional Transfer Tests: Cross-Model, Cross-Environment, Cross-Dataset

This thin document, refined through layers of truncation and interception, exhibits high generalization.

Cross-Model Transfer: Spreadsheet skills trained with GPT-5.4 as both target and optimizer, deployed unchanged to tiny GPT-5.4-mini for zero-shot inference, retain ~82% of the original gain (+9.4 vs. original +11.4). On some math tasks, feeding strong-model-distilled documents to weak models (GPT-5.4-nano) even surpasses the weak model's self-distilled ceiling (28.8% vs. 27.2%).

Cross-Harness Transfer: This is the most practically valuable test. Spreadsheet skills trained in OpenAI's Codex sandbox, ported directly into Anthropic's Claude Code execution loop, yield +59.7 absolute gain for the latter—slightly exceeding Claude Code's own full SkillOpt optimization in its native environment (80.4%). With completely different underlying tool APIs, this proves the optimizer extracted not mere command-line memorization but high-level methodologies like "check workbook structure, prioritize formula verification, solidify static values."

Cross-Benchmark Transfer: Skills trained on OlympiadBench, ported directly to the completely differently formatted Omni-MATH for closed-book exams, show universal positive gains across model specs (+1.3 to +3.7). This reaffirms the engineering value of minimal text updates in noise isolation.

4.4 Ablation Analysis: Stripping Optimizer Strength

The team designed a rigorous control: if during training, the top-tier GPT-5.5 optimizer is removed and replaced with a weak model isomorphic to the trainee (e.g., GPT-5.4-mini) for "self-guidance," what happens? With all core mechanisms (learning rate bounds, validation interception, slow updates) locked, the weak optimizer still recovers 56%-74% of the strong optimizer's gains. This shatters the skepticism that "gains only come from a strong teacher," proving the constrained optimization workflow itself is the true lever for Agent capability. Ignoring buffers, learning rates, and discarding validation sets—the haphazard modifications—are the true culprits behind production Agent fragility.

Conclusion: Restraint and Boundaries

SkillOpt demonstrates an extremely restrained and rigorous methodology—a superb course correction for the current hype around Agent capability expansion. If your team is building Agent systems that run long-term in specific business environments (e.g., specific-format financial report extraction, specific-source quant data cleaning, complex multi-step investment research/filing processing), this methodology provides a fully modular, out-of-the-box reference example.