Are Your Custom Skills Slowing Down the Model? Strategy Genes Are the Real Answer

There exists a certain "Agent Metaphysics": You have clearly defined the task background, broken down the workflow, stuffed in common pitfalls, API usage, example code, and precautions, and even wrote a lengthy Skill document specifically. Yet, when the same type of task arises next time, the model may still make mistakes in the exact same spot.

This approach shares a common premise: that experience, when stored, recalled, invoked, and fed back to the model as content, will lead to improvement.

Delving deeper into this phenomenon reveals an interesting, useful, yet "counter-intuitive" question: A comprehensive and detailed document does not equate to a high-quality control object.

This is where the industry has fundamentally misunderstood Skills. Everyone treats Skills as the endpoint of intelligence reuse, ignoring that the model does not "read" a document; rather, within a limited inference budget, it searches for the next strategy, which behaviors must be avoided, and which constraints hold the highest priority.

For human engineers, completeness implies security and standardization; but for the model, completeness often means diluted signals, watered-down priorities, and control drowned out by background material. In other words, the strength of Skills lies precisely in their ability to serve human understanding, not to serve the model's decision-making in immediate tasks.

Recently, the EvoMap team (Infinite Evolution Lab × Tsinghua University) conducted systematic research around this issue, proposing a highly memorable new concept: Gene. Inspired by biology, where genes are DNA fragments encoding proteins derived from shared memories and experiences inherited over millennia, an Agent's Gene represents verifiable and reusable knowledge assets precipitated through the GEP (Gene Evolution Protocol) mechanism.

Paper Title: From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
Authors: Junjie Wang, Yiming Ren, Haoyang Zhang
Institutions: Infinite Evolution Lab (EvoMap) × Tsinghua University
arXiv: https://arxiv.org/abs/2604.15097
Evolver (Evolution Engine): https://github.com/EvoMap/evolver
CritPt Task Reproduction Repository: https://github.com/EvoMap/critpt-openclaw-reproducible-70

The paper demonstrates through 4,590 controlled experiments across 45 scientific code scenarios + end-to-end validation on the CritPt benchmark that:

When the same underlying experience is injected into the model, the complete Skill package actually performs worse than the no-guidance baseline, whereas the Gene object, which is more than ten times shorter, consistently wins.

This preference is not just about the moment of "writing the Prompt"; it permeates the design principles of "how an Agent continues to evolve during testing." Often, what determines whether an Agent is smart is not "how much experience you have stored," but "what shape that experience takes the moment it returns to the model."

What does this inspire? Today, when the industry discusses Agent optimization, the keywords are always: stronger base models, longer context windows, more advanced RAG, and more complex memory systems. However, Gene reveals that the key to experience reuse is not giving the model more content-heavy prompts, but rather crafting experience into a compact, control-oriented, and sustainably evolvable object. This aspect has been almost entirely overlooked by the entire Agent community in the past.

What is Gene?

The EvoMap team's research found that experience objects intended for models should be designed based on "control density" rather than "document completeness."

However, the team did not stop at this empirical observation. After solidifying the phenomenon through 4,590 controlled experiments, the EvoMap team defined a reproducible, mutable, and heritable solution strategy. Gene is part of a complete three-layer framework for objects:

Gene: Contains four types of signals: keywords, summary, strategy, and AVOID. It can be directly injected as a test-time control fragment. It serves as a reusable evolutionary strategy template for Agents, defining "under what circumstances, do what thing, abiding by what constraints"—equivalent to encoding prior knowledge.

A complete Gene includes fields such as signals, strategy, constraints, validation, and a unique asset_id.

Under extremely short Token limits, it possesses extremely high control density, clearly specifying the model's reference trigger signals (supporting substring matching, regex, and multi-language aliases), ordered executable steps and execution verification, and safety boundaries (limiting the scope of changes and prohibiting paths that must not be touched), along with SHA-256 based content-addressable hashing to prevent tampering.

Capsule: Verified task-level execution paths + audit records.

Event: Immutable evolution logs.

These three components are linked by a six-stage cycle, forming the GEP (Gene Evolution Protocol):

See details: https://evomap.ai/wiki/16-gep-protocol

In plain language, the entire operational flow is as follows:

First, distill past failures, successes, and repair paths into a Gene (not writing documentation, but writing traceable control signals);
When a new task arrives, Scan the task context → Match the most relevant Gene → Inject as a System Instruction;
After execution, write the result back in the form of an Event, triggering Validate / Mutate / Solidify operations on the Gene—allowing the Gene pool itself to continuously evolve without updating the base model parameters.

How Gene "Dimensionally Strikes" Skill

All data comes from the same experimental pipeline: On two fixed models, Gemini 3.1 Pro Preview (Pro) and Gemini 3.1 Flash Lite Preview (Flash), using sandbox execution + Checkpoint pass rate as metrics, with temperature T=0.05 and a maximum output of 16,384 tokens.

Skill loses to Gene not in quality, but in form.

The paper first made the most direct comparison: The same underlying experience was packaged into a ~2,500 token Skill package and a ~230 token Gene object respectively.

The complete Skill package performed 1.1 percentage points (pp) lower than the no-guidance baseline on average across both models, while the shorter Gene performed 3.0 pp higher. The most striking part is: Skills weren't uniformly bad; they showed improvement on the weaker Flash model (41.8→49.0), but on the stronger Pro model, they severely dragged performance down (60.1→50.7)—the long Skill directly suppressed Pro's inherent capabilities.

"Procedural skill," the most common document-style experience package today, usually contains: overview, workflow, pitfalls, error handling, API notes, examples, and scripts. Experiments revealed exactly which parts were effective:

Only the Workflow section played a serious role; the Overview was actually the largest negative contributor to the entire text. The useful signal in a Skill is sparse, concentrated in a small segment of procedural content, while the vast amount of material serving "human readability" actually dilutes or even pollutes the control signal.

Skill loses to Gene not due to knowledge volume or information density, but due to the choice of the controlled object.

Stuffing materials meant for human viewing into the model's execution budget instead becomes control noise.

Gene is not just "less is more, more is confusion" prompting.

Reading this, the easiest counter-argument to arise is: "Does Gene win simply because it's short and doesn't hog context?"

In reality, Gene addresses three types of evolutionary intent regarding failure:

The paper specifically used a budget alignment experiment to truncate the effective part of the Skill to match Gene's 230 tokens:

With the budget exactly the same, Gene still crushed the competition. Truncation indeed prevented Skill from scoring below baseline, but no matter how it was cut, it could not reach Gene's height.

The paper also performed progressive construction to see which layer inside Gene was actually working:

Note the second row: keywords + summary actually reverted to the no-guidance baseline. What truly lifted performance was the strategy layer. With the same word count, organizing it as a "summary" is useless; organizing it as a "strategy" is effective.

Gene is not a shorter prompt; it is an object of a different morphology. What determines model behavior is the control structure, not the token count; the strategy layer is indispensable.

In the paper's perturbation experiments, the most counter-intuitive finding was: A stale_paradigm Gene written with outdated algorithm paradigms achieved 56.6%, which was higher than the clean Gene's 54.0%; however, swapping to the wrong algorithm dropped it to 48.8%, and swapping to the wrong domain dropped it to 49.4%—the conditions for score drops were right next door.

These two results together complete the picture: The effective condition for Gene is "retaining the task-relevant control framework," not "how new the writing is." Expired methods still work well if the framework is correct; new methods will drag performance down if the framework is wrong. This comparison also hints at Gene's robustness boundaries: very tolerant structurally, but very picky semantically.

Summarizing Failure: The Optimal Form is Not Logs, but Distilled Warnings

Everyone building Agent systems faces a question: How should failure be stored?

Long trajectories? Reflection summaries? Error logs?

The key question the EvoMap team looked at was: If engineering budget is limited, in what form should failure return to the model?

The paper ran two sets of controls simultaneously.

Control 1: Failure placed in different carriers

Stuffing failure into Skill or free text resulted in performance below the no-guidance baseline in all cases.

Gene was the only carrier with a positive contribution—but even so, Gene + Failure performed worse than Gene alone (54.0 → 52.0).

Appending failure in its raw form actually diluted the Gene.

Control 2: In what form are failure and strategy mixed?

The strongest performer was neither a "Failure + Strategy" hybrid nor "Strategy only," but rather failure warnings only—distilling failure into independent "AVOID xxx" statements proved even stronger than retaining the strategy body itself.

In other words, the failure experience truly useful to an Agent does not grow into a "log," but into something like this (real AVOID examples from the paper's UV-vis spectroscopy scenario):

AVOID passing min_distance as a wavelength value to scipy.signal.find_peaks; convert to sample-index units first.
AVOID reporting the raw output of peak_widths directly as FWHM; convert back to wavelength units first.

The principle behind this is very clear: The accumulation of failure experience should be selective compression, not additive stacking.

What does a Gene look like? A Minimal Verifiable Artifact

At this point, we should take a look at what a real Gene looks like. Below is an injection example from the paper's UV-vis scenario:

Domain keywords: uv-vis, peak detection, FWHM, unit conversion
Summary: Detect peaks and compute wavelength-domain peak properties correctly
Strategy:
  1. Detect peaks with prominence-based criteria
  2. Convert min_distance into sample-index units before peak detection
  3. AVOID: Report FWHM only after converting peak_widths outputs back to wavelength units

Approximately 230 tokens, 5 fields. Its counterpart is the Skill package for the same experience:

Approximately 2,500 tokens, containing subsections like overview, workflow, pitfalls, API notes, examples, scripts, etc., resembling a README file in overall form.

In the paper's experiments, both used the same systemInstruction injection slot and the same set of sandbox evaluation scripts—meaning the control conditions were 完全 identical; the only difference lay in "what shape the injected content took."

The GEP protocol further standardizes this raw Gene into a verifiable object with fields like id, schema_version, signals_match, strategy, constraints, validation, and asset_id—aiming to make it matchable, replaceable, revisable, and combinable, rather than remaining "a nicely formatted prompt."

The Rules at the Protocol Layer Have Also Changed

The most brilliant aspect of Gene is that it did not confine the "experience object" to a clever Prompt trick, but went straight to the protocol layer.

In the test-time control (Inference) phase, the logic is very smooth: for the same scientific code problem, swapping the ~2,500 token Skill package for the ~230 token Gene control fragment immediately makes the model calculate more accurately.

But regarding the protocol layer, the EvoMap team made a more fundamental judgment: When experience objects are exchanged between multiple Agents, they must be an object, not a document.

Why? Because without a protocol, Gene is still just a prompt—unstable boundaries, incomparable fields, and non-cumulative. Once protocol-ized, Gene transforms from a "prompt fragment" into an object that is matchable, replaceable, revisable, and combinable, capable of being continuously revised, audited, traced, and used consistently across multiple Agents.

GEP is not just formatting details; it is the layer of protocol that elevates Gene from a test-time control object to a persistent strategy optimization interface.

Experimental Results: The "Free-Riding" Dark Horse on the CritPt Leaderboard

To speak with data, the EvoMap team ran Evolver end-to-end on CritPt, a public frontier physics benchmark.

CritPt is a dynamic dataset strictly simulating real-world physics research processes. Benchmark website: https://critpt.com/

Evolver is a complete system comprising "Base Model + Gene Pool + Evolution Engine + Toolchain".

(Where OpenClaw serves as the host runtime, Evolver as the evolution engine, and Gene/GEP as the object and protocol layers); the recently popular Hermes Agent also "borrowed" from Evolver's design philosophy to some extent.

The full reproduction answers for the Benchmark 70 tasks can be found at (https://github.com/EvoMap/critpt-openclaw-reproducible-70).

As can be seen:

Evolver (Gene) 2026-02-16: Base Model A 9.1% → 18.57%, +9.47pp
Evolver (Gene) 2026-03-26: Base Model B 17.7% → 27.14%, +9.44pp

Without updating a single parameter, without adding any SFT/RL, purely through the evolution of the experience object layer—the same base model was directly lifted by +9pp magnitude. Meanwhile, token consumption dropped from $100 to less than $1.

February 16th Gemini 3.0 base model experimental results.

What Does Gene Bring to the Industry?

The Gene constructed by the EvoMap team has turned a vague "intuition" into a definable, auditable, evolvable, and test-time control-oriented experience representation methodology.

For the application layer, separating "Skill documents written for colleagues" from "control signals injected into the model at runtime" could be a "magic trick" with almost zero cost and immediate effect. For researchers working on Agent long-term memory and Reflection: the best form of sedimentation for failure is not trajectory logs or reflection summaries, but AVOID warnings. When GPU resources are tight, what experience to retain depends not only on whether it was collected correctly, but also on whether it fits well enough with the model's current execution budget.

Furthermore, in the setting of multi-Agent experience exchange, transmitting structured Gene objects is more suitable as protocol-layer payload than transmitting Skill documents—because only objects that can be matched, revised, and verified can truly accumulate and evolve among multiple parties.

Conclusion

Gene acts like a mirror, reflecting the essence of Agent experience reuse:

Agents are not "reading a manual," but rather "searching for what to do next and what must be avoided within a limited inference budget."

However, this is bidirectional—the shape of the experience object you feed the Agent, in turn, defines what it can evolve into.

While the entire AI circle is blindly rolling themselves to death for longer context, fancier RAG, and more complex memory systems, the EvoMap team has lightly provided an incredibly simple clue:

The shortcut to making Agents continuously stronger is not writing more complete prompts, but crafting execution experience into a more compact, more controllable, and more evolvable object. This is useful on hard benchmarks like CritPt, and even more useful for multi-Agent experience exchange at the protocol layer, pointing out a path for future A2A swarm intelligence.

In the Agent era, the competition in the next stage is not just about larger models and longer contexts, but who can first find a better general solution for the efficiency of utilizing intelligent compute power.

Haoyang Zhang: Post-95s serial entrepreneur, Founder & CEO of EvoMap, author of the GEP (Genome Evolution Protocol). A phenomenal developer in the OpenClaw community, his Evolver plugin topped the ClawHub charts in 10 minutes and garnered 36,000 downloads in 72 hours, becoming the most widely known "self-evolving" tool, leading to the founding of EvoMap.
Junjie Wang: Chief Scientist at EvoMap, research focus: Agent self-evolution, protocol layers, experience object design. PhD from Waseda University, Postdoc at Tsinghua University, long-term systematic research on "how Agents can continuously become stronger at test-time," and one of the main developers of Evolver.

Please contact our official account for reprint authorization.

Submissions or media inquiries: liyazhou@jiqizhixin.com

Are Your Custom Skills Slowing Down the Model? Strategy Genes Are the Real Answer

Related Articles

分享網址