This latest research from Microsoft might change the way you write Skill documents.
I'm sure you've been writing various Skills and documents for your Agents: CLAUDE.md, best_skill.md, agent instructions...
Like me, you might spend an hour, two hours, or even half a day meticulously polishing a set of instructions, hoping the Agent will become smarter.
But the conclusion of Microsoft's paper is a bit of a reality check: The one you handwrote is probably not the optimal one.
Microsoft's research team proposed a method called SkillOpt. The core idea is:
Treat the Skill document as the "weights" of a neural network and automatically optimize it in a manner similar to training a neural network.
SkillOpt Architecture Overview (Source: Paper)
The result? The method achieved optimal or tied-for-optimal performance in all 52 test combinations, with an average gain of 23.5 points, crushing human-written Skills.
About Skills
Tools like Claude Code, Codex, and Cursor already support users in writing an "instruction document" to guide the Agent's behavior. Whether it's Claude Code's CLAUDE.md, Codex's Agents.md, or various Skill documents, they all share a common trait:
A block of plain text instructions that tells the Agent what to do in specific situations.
For example, if you write a line like "When encountering Excel formulas, first check the worksheet structure, then write static values instead of relying on Excel's auto-calculation," the Agent will follow it when handling tasks on benchmarks like SpreadsheetBench.
This seems perfectly logical, nothing special.
But the problem is, how do you know the few rules you wrote are the best ones?
Based on your experience, you wrote 5 rules. You might have missed 3 critical ones, and 2 might not be precise enough. The more troublesome part is... you have no idea what you missed because you can't exhaust all possible ways to write them.
Finding the balance between flexibility and guidance is incredibly difficult.
SkillOpt's starting point is: Since humans can't write them well, let the AI optimize its own instruction manual.
The Training Loop
The core idea of SkillOpt can be summarized in one sentence:
The Skill document is the Agent's only mutable external state, so train it as if it were "weights."
The Agent's model parameters are frozen and cannot be changed, but the Skill document is just plain text that can be freely edited. Given this, why not optimize this document iteratively with a complete optimization process, just like training a neural network?
Once you start thinking this way, the whole methodology falls into place naturally.
SkillOpt Training Process (Source: Paper)
Let's look at how various concepts from deep learning map to the text space here.
When training a neural network, you feed data in for a forward pass. The corresponding operation in SkillOpt is called a rollout: having the Agent perform a batch of tasks using the current Skill document and collecting the completion statuses.
After the forward pass, you calculate the gradient. The corresponding operation in SkillOpt is called a reflection: using an optimizer model to analyze which tasks failed, why they failed, and extracting directions for improvement.
Once you have the gradient, you update the weights. The corresponding operation in SkillOpt is called an edit: performing three types of structured edits on the Skill document – add, delete, and replace.
During training, a learning rate controls the step size. SkillOpt also has a textual learning rate: a maximum of only L_t rules can be changed per round (default 4), with a cosine decay schedule.
Finally, during training, you use a validation set to checkpoint and save the optimal model. SkillOpt similarly has validation gating: after an edit, it runs on a validation set. If the score doesn't increase, the modification is rejected.
The entire process is essentially a one-to-one translation of the deep learning training loop into a text-editing loop.
The Division of Labor Between Two Models
SkillOpt uses two models.
One is called the target model, which is the Agent you usually use, like GPT-5.5 or Claude. It's responsible for performing tasks using the Skill document. The model itself is frozen and not modified.
The other is called the optimizer model, a super-powerful, frontier-level model responsible for analyzing the target model's performance and proposing modifications.
As an analogy, the target model is like the worker on a factory floor, and the optimizer model is like a management consultant standing by and observing. The worker follows the operations manual, and the consultant observes where the worker is falling short and then revises the manual.
This division of labor brings a key benefit: The cost of the optimizer model is only incurred during the training phase. It's absolutely not needed for deployment.
The paper also tested the effect of using a same-tier model as the optimizer (e.g., using GPT-5.4 to optimize GPT-5.4's own Skill).
The result was that a same-tier optimizer can also work, recovering approximately 56%-74% of the gains from a stronger optimizer. However, using a more powerful optimizer yields noticeably better results, as it can see problems the target model itself cannot.
The Art of Restraint
One specific design choice is that SkillOpt changes at most 4 rules per round.
Intuitively, you might think: since AI is optimizing, why not let it rewrite the entire document in one shot?
The research team actually tried that... and concluded: Unbounded changes performed worse.
Unbounded rewriting scored 2-3 points lower than setting L_t=4.
The reason is easy to understand. Just as a learning rate that's too high causes oscillations when training a neural network, changing too much at once mixes good and bad revisions. The validation set can't accurately judge which changes are useful.
Another design is the rejected-edit buffer.
Edits rejected by the validation set are not simply discarded; they are stored in a buffer. In subsequent reflection phases, these "lessons learned" are visible, preventing the optimizer from repeating the same mistakes.
This acts like negative feedback during training, giving the optimization process a memory.
Another critical mechanism is the slow/meta update, analogous to momentum in deep learning.
At the end of each epoch, the optimizer reviews the Skill documents from the current and previous epochs and performs a cross-epoch longitudinal update. The content of these slow updates is protected and cannot be overwritten by step-level edits.
Ablation studies showed that removing the slow/meta update causes the score on SpreadsheetBench to plummet from 77.5 to 55.0—a drastic drop of 22.5 points.
Sometimes, restraint is more effective than aggression.
Wildly Effective Results
So, after all this design talk, how are the results?
One word: devastatingly good.
The research team tested on 6 benchmarks, covering single-turn Q&A, multi-turn code generation, document operations, multimodal document understanding, mathematical reasoning, and embodied environment interaction.
Compared to the results from a simple chat with GPT-5.5:
SearchQA: 77.7 → 87.3 (+9.6)
SpreadsheetBench: 41.8 → 80.7 (+39.0)
OfficeQA: 33.1 → 72.1 (+39.0)
DocVQA: 78.8 → 91.2 (+12.4)
LiveMath: 37.6 → 66.9 (+29.3)
ALFWorld: 83.6 → 95.5 (+11.9)
The change curve of scores during training.
SpreadsheetBench and OfficeQA each jumped 39 points... This is far from a minor incremental tweak; it’s almost a qualitative change from "barely usable" to "quite capable."
And it's not just effective in direct conversation scenarios. It shows an average improvement of +24.8 points in the Codex execution environment and +19.1 points in the Claude Code execution environment.
52 test grids, all achieved best or tied-for-best results. Not a single loss.
Crushing Human Handwriting
You might ask: How does it compare to human-written Skills?
The research team made a specific comparison.
SkillOpt vs. all baseline methods.
Using carefully crafted, human-written Skill documents (145-516 tokens) as a baseline, SkillOpt's average score on direct GPT-5.5 conversation is 82.3. In contrast, the combined average of the "best-choice-per-benchmark" from all other methods, including human handwriting, was only 76.9.
That means, even if you hand-pick the best-performing baseline method for each benchmark, the combined average score still loses to SkillOpt.
The compared methods include: One-shot LLM-generated Skills, Trace2Skill (distilling from trajectories), TextGrad (gradient-style optimization), GEPA (Pareto reflection evolution), and EvoSkill (skill folder evolution).
All of them, crushed.
What Did the AI Learn?
What do the optimized Skill documents actually look like?
The paper showcases several learned rules. After reading them, you'll feel that these are rules humans would truly struggle to come up with, but once you see them, you'll think, "Yes, that's exactly how it should be written."
SearchQA: Infer the expected answer type from the phrasing of the clue, then select the shortest canonical entity from the co-occurring unique evidence. (Tells the Agent not to give long-winded answers, but to precisely pinpoint the shortest, canonically-named entity.)
SpreadsheetBench: First check the workbook structure and formulas, then write computed static values throughout the target range of the request instead of relying on Excel's auto-calculation. (Captures a critical bug: many Agents write Excel formulas and expect auto-calculated results, but in automated environments... this is often unreliable.)
ALFWorld: Maintain a horizon-aware list of visited/frontier locations, switch search directions after consecutive failures of the same type, and avoid revisiting the destination before obtaining the target object. (Teaches the Agent to perform spatial memory management in a virtual environment, preventing it from going in circles in the same place.)
These rules share a few common characteristics:
Extremely Specific: There's no filler like "check carefully" or "think hard"—none of those empty platitudes your boss loves. Every single rule is precise down to the operational level.
Counter-Intuitive: They address scenarios a human would never even consider when writing a Skill.
Compact: The final Skill files are only 379-1995 tokens, with a median of about 920 tokens. For some benchmarks, even a single accepted modification was enough to boost the score by 39 points.
The Evolution Process
Just looking at the final rules might not leave a deep impression.
So, the paper also shows the complete evolution process of the Skill document, letting you see how a blank Skill grows step-by-step into its final version.
Take ALFWorld as an example:
Initial State: A generic instruction: search, transform, place. Something like "find stuff, process it, put it in the designated location."
Round 1 Rollout: Found the Agent frequently couldn't locate the target object, repeatedly searching the same room. Added a rule: Remember the places you've been, don't revisit them.
Continued Iteration: Found the Agent would lose the item on the way after picking it up. Added a rule: Lock in progress after picking up an item; no superfluous actions.
Deep Optimization: Introduced rules for a loop detector, exact object name matching, and more.
Result: 49.3 → 74.6 (+25.3). From barely functional to quite capable.
The evolution of SpreadsheetBench is very similar. The initial Skill was just a generic automation command. After a few rounds of optimization, the Agent learned a series of meticulous operations: first check the workbook's headers and ranges, perform key normalization, use static values instead of formula dependencies, and retain auxiliary calculation columns.
Final effect: From 40.4 points to 78.9, an increase of 38.5 points.
All these evolutionary processes show: a good Skill document isn't something one person sits and thinks up. It should be something that emerges from practice.
Cross-Model and Cross-Environment
Another feature of SkillOpt is that the optimized Skills can be transferred across models and execution environments.
Cross-model Transfer: GPT-5.4 → GPT-5.4-mini (SpreadsheetBench) (+9.4).
Cross-environment Transfer: Codex → Claude Code (SpreadsheetBench) (+59.7).
Cross-task Transfer: OlympiadBench → Omni-MATH (GPT-5.4) (+3.7).
In other words, a Skill optimized with one model, change the model... change the tool... even change the task, and it's highly likely still effective.
The training cost is a one-time investment (offloaded offline), while the extra cost during deployment is zero. The optimized Skill file is a block of plain text that you can just pick up and use directly.
Training Cost
So, how much does this optimization process cost?
The data given in the paper is: for process-oriented benchmarks (SearchQA, DocVQA), every 1 absolute test score point of improvement requires 0.6–3.6M training tokens. For complex trajectory-based benchmarks (SpreadsheetBench, ALFWorld), it requires 37.9–46.4M tokens.
This cost isn't actually expensive (one could even say it's quite cheap), and the key point is: you only need to train it once.
The trained Skill file incurs no additional cost each time it's used. If your Agent has to run thousands or tens of thousands of tasks, this paltry training cost is amortized quickly, but the magnitude of improvement it brings is clearly well worth it.
It's like spending a sum of money to hire a consultant to write an operations manual, and then all future employees just follow the manual. The consultant can be dismissed after a single use, of course...
One Size Fits All
The paper also tested the performance of models of different scales.
Besides frontier large models like GPT-5.5, GPT-5.4, and GPT-5.2, the research team also experimented on smaller models such as GPT-5.4-mini, GPT-5.4-nano, Qwen3.5-4B, and Qwen3.6-35B-A3B.
The results show that all models of all scales see consistent improvement.
This indicates you don't necessarily need the most expensive model to benefit from SkillOpt. Even a small model paired with an optimized Skill can outperform a raw, large model.
Another piece of data is the impact of training data volume on the effect.
Using SpreadsheetBench as an example, with 1% of the training data for optimization, the score was 47.5. Using 100% of the training data, the score rose to 78.0.
The more data, the better the Skill optimization. But even with a modest amount of data, SkillOpt still delivers significant improvements.
Try It Yourself
Microsoft has fully open-sourced SkillOpt (under the MIT license), so you can run it directly.
Installation is simple:
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .Then configure your API keys:
cp .env.example .env
# Fill in your API keys, then source
source .env
# Azure OpenAI (Recommended) (After all... it’s Microsoft’s own product)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-key"
# Or use OpenAI directly
export OPENAI_API_KEY="sk-..."
# Anthropic Claude is also supported
export ANTHROPIC_API_KEY="sk-ant-..."After that, a single command starts the training:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_dir /path/to/your/searchqa_split \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5 \
--num_epochs 4 \
--batch_size 40It currently supports 6 benchmarks: SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, and ALFWorld. Each has a corresponding configuration file, placed in the configs/ directory.
After training is complete, the output directory will contain a best_skill.md, which is the final, optimized Skill file, ready to be used directly:
outputs/<run_name>/
├── best_skill.md # The optimal Skill document
├── history.json # Training history
├── skills/skill_vXXXX.md # Snapshots of each step
└── steps/step_XXXX/ # Patches and evaluations for each stepYou can run the evaluation separately:
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/my_run/best_skill.md \
--split valid_unseen \
--split_dir /path/to/searchqa_splitEven more, the team even provides a WebUI for real-time monitoring of the training process:
pip install -e ".[webui]"
python -m skillopt_webui.app --port 7860The entire project also supports resuming training from a checkpoint. If interrupted, re-running the same command will automatically continue from the last completed step.
Stop Handwriting from Now On
The signal conveyed by this Microsoft paper is clear:
Handwriting Skill documents might be a good starting point, but it shouldn't be the end goal.
Of course, the paper honestly acknowledges the method's limitations: SkillOpt requires tasks with an automatically evaluable standard (exact match or an auto-scorer). Open-ended tasks are not yet suitable for it.
To summarize, the core loop of SkillOpt is:
Make the Agent do tasks → Analyze failure causes → Generate edit suggestions → Validate on a validation set → Accept or reject. You can absolutely mimic this process manually: observe which tasks the Agent makes mistakes on, analyze the error patterns, supplement rules in a targeted manner, and then verify the effect.
Skill documents shouldn't be a one-time write-and-forget thing. Like model weights, they should be continuously optimized.
Relevant Links
Paper: https://arxiv.org/abs/2605.23904
Project Homepage: https://microsoft.github.io/SkillOpt/
GitHub Code: https://github.com/microsoft/SkillOpt
Demo Video: https://youtu.be/JUBMDTCiM0M
Related Project SkillLens: https://microsoft.github.io/SkillLens/