The Age of Agent Skills: How Big is the Gap Between Strong and Weak Models? Shattering the 'Cheap Alternative' Illusion | Latest from Oxford

Currently, the industry's development focus is rapidly converging on frameworks centered around Agent Skills, such as Openclaw. A consensus has been reached: converting repetitive API chains into executable Agent Skills is the only path to solving the "context explosion" problem in long-cycle tasks. However, once the concept is established, the real deep-water combat zone begins. When facing the selection and configuration of Openclaw, you are likely to get stuck on the following four architectural questions:

Skill Construction:Should we use anexpensive strong model(such as Claude Opus 4.6) to solve the problem in one go? Or should westack attempts with weak models?How big is the difference?
How exactly do multi-agents collaborate? If I have a Multi-Agent Lobster 🦞 team,how should functions be allocated between the main model and edge models?
Should the Skill tree be fully upgraded? Will complex nested calls exceed the capability limits of LLMs?Where is the boundary? What is the most stable solution for production environments?
How high is the hidden cognitive cost of Agent Skills as a generalized interface?Under what circumstances must an Agent decisively abandon Skills? Should it directly call the underlying API?

The recent release of the "SkillCraft" benchmark by the University of Oxford provides extremely precise quantitative answers to these four specific pain points.

In today's article, we won't discuss empty macro trends. Instead, we will directly dissect the Token bills, error stacks, and call logs derived from this research to reveal the underlying design logic of multi-agent collaboration in the era of Skills.

SkillCraft Benchmark and Test Environment Construction

To ensure the data reliability of all subsequent architectural recommendations, we must first examine how the researchers constructed the test sandbox and evaluation criteria. This is the key to filtering out "scoring toys" with no engineering value and obtaining real industrial-grade data.

Systemic Flaws in Existing Toolchain Evaluations

Current tool-use benchmarks (such as WebArena, AgentCompany, etc.) typically fix the toolset during deployment and adopt a single-instance evaluation logic: testing whether an agent can use given tools to solve a single, one-off task. This single-shot testing exposes two core efficiency bottlenecks in long-cycle tasks:

Redundant State Transfer: Complex business logic is decomposed into a series of atomic operations. In this process, intermediate results (such as massive web DOM trees or lengthy JSON API responses) are repeatedly serialized between consecutive tool calls and forcibly injected into the model's context, generating extremely high Token I/O overhead.
Context Window Saturation: Long sequences of tool calls and their massive return results occupy a large amount of context capacity, causing the model to severely lose early information in the later stages of execution, or even completely deviate from the original system instructions.

Task Pool Construction and Multi-dimensional Expansion Logic

The SkillCraft benchmark constructed by researchers deliberately embeds repetitive sub-structures within a single task, forcing the agent to identify and reuse tool combinations multiple times within a fixed resource budget. The construction process is completed through a rigorous three-stage pipeline:

Stage One and Stage Two: Extract task design principles from existing benchmarks and construct 21 seed tasks based on real public Web APIs (such as GitLab, Open-Meteo, TVMaze, etc.) and local datasets.
Stage Three (Multi-dimensional Expansion): Researchers expanded the seed tasks along two orthogonal axes: Quantity Expansion (e.g., increasing from "analyze 1 repository" to "analyze 5 repositories") and Complexity Expansion (increasing the number of underlying API calls required for each sub-entity).

Ultimately, SkillCraft established extremely standardized test cases:

It includes 126 long-cycle tasks spanning 6 major domains (Entertainment, Reference, Education, Developer, Science, Food). Test difficulty is quantified into three absolute standards: Easy level (3 entities, 3 API calls per entity, 9 calls total), Medium level (4 entities, 16 calls total), and Hard level (5 entities, 5 API calls per entity, 25 complex calls total).

Underlying Infrastructure: Skill Mode Protocol Stack and Security Verification

To empower the model with the ability to construct tool combinations during testing, the system maintains a local skill_cache.json file in the agent's workspace. The agent is strictly limited to interacting with it only through the following four MCP (Model Context Protocol) primitives:

save_skill: Saves a successful workflow as an executable code macro. Must pass in macro_name (unique identifier), script_code (Python script code), parameters (list of variable names), and description (summary).
execute_skill: Runs a saved skill. Requires passing in macro_name and a specific parameter dictionary args; the system returns a status flag and an execution result dictionary.
list_skills: A parameter-less call that lists all available skills in the current session for the model to check before making decisions.
get_skill: Retrieves the full source code and parameter signature of a target skill, used for debugging and verification in complex scenarios.

To ensure that the code generated by the agent does not trigger system-level disasters, researchers deployed a Coding Verifier with a three-stage defense mechanism:

Syntax Verification: Before save_skill is written, the underlying system performs AST parsing to intercept basic syntax errors and return specific error line numbers and code snippets to the model.
Runtime Error Reporting: When execute_skill crashes, the sandbox intercepts system exceptions and returns structured Tracebacks and input parameters to the model, assisting it in locating parameter binding errors.
Post-execution Quality Check: To prevent silent failures, the system strictly validates the output dictionary. If more than 50% of the field content is Unknown, None, or 0, the script is directly rejected from storage.

Regarding experimental boundaries, researchers imposed strict constraints: each task is hard-limited to a maximum of 150 dialogue turns and a timeout of 60 minutes; globally, the maximum cumulative consumption is 1M input Tokens and 150K output Tokens; model sampling parameters are completely locked at temperature=0.0 and top_p=1.0 to guarantee output determinism. On the scoring end, file generation, JSON structure validity, data integrity, and field-level accuracy (total score over 90%) must all be satisfied simultaneously to be counted as a success.

Core Conclusion: Use Strong Models Instead of Weak Ones

After understanding the above test environment with no watered-down data, let's look at the most concerning question in actual combat: To save costs, should we use cheap weak models for multiple attempts, or use expensive strong models to hit the mark in one shot? The data gives the latter as the answer.

Exponential Collapse of Token Consumption

The experiment compared the Baseline mode (with skill library interfaces disabled) and Skill Mode (with skill libraries enabled). The data clearly shows that in complex long-cycle calls, strong models can greatly offset the cost disadvantage brought by their higher pricing by writing code to implement internal logic flow:

GPT-5.2: Average Token consumption dropped drastically from 1.23M to 0.26M (a reduction of 79%), and the average API cost per task plunged from $1.77 directly to $0.43 (saving 75%).
Claude 4.5 Sonnet: The baseline success rate was already as high as 96%, maintaining a high level of 94% in Skill Mode. Its Token consumption dropped from 1.36M to 0.40M (a reduction of 71%), and the number of pure underlying tool calls decreased from 14.3 to 9.2.
DeepSeek-V3.2-EXP: Token consumption decreased by 49% (from 1.04M to 0.53M), effectively halving the cost.

Although agents consume a small number of decision turns when querying, verifying, and caching skills (for example, Gemini 3 Pro's average interaction turns increased by 13%), since the code engine directly takes over the huge workload of data cleaning and transit, avoiding the injection of invalid loads into the Prompt, the system's overall Token consumption still presents a cliff-like drop.

Correlation Analysis: Coding Ability IS Skill Ability

Researchers revealed the underlying laws of system operation through cross-metric correlation analysis. The skill execution success rate and the final task success rate show a strong positive correlation ( r=0.65 ), and efficiency gains are positively correlated with the model's baseline ability ( r=0.53 ).

This explains the engineering Achilles' heel of open-source weak models in such architectures: when a model's success rate on baseline Hard tasks is below 60%, its code generation ability is equally weak. When encapsulating parameterized interfaces, weak models frequently produce inferior code containing syntax errors or logical deadlocks. Subsequently, the system intercepts the errors, and the model is forced into an infinite "debug-rewrite" death loop. In this process, the framework introduced to save computing power ends up burning a large amount of Tokens to fix code.

SkillCraft Conclusion 1: When building a basic Skill library, avoid using weak models to try their luck by stacking Agent quantities. The single-model code generation accuracy of strong models has an absolute overwhelming advantage in terms of total system cost.

Multi-Agent Cross-Model Collaboration: Creators Greater than Executors

In multi-agent collaboration frameworks like Openclaw, architects usually need to decompose and distribute tasks. How should the "main model" and "edge models" in the system allocate functions? The cross-model test data from the paper provides extremely clear engineering guidelines.

Cross-D Difficulty Migration Test (Cross-task Generalization)

When verifying skill generalization, researchers conducted rigorous static migration tests (i.e., directly using skill code generated from simple tasks in complex tasks, prohibiting the model from modifying the code during the process). The data confirms that high-quality parameterized code possesses strong cross-level compatibility: skills extracted by Claude 4.5 and Gemini 3 Pro in Easy-level tasks were seamlessly migrated to Hard-level tasks, maintaining an extremely high execution success rate of 97%-100%. Claude's Easy->Hard migration pulled the success rate from a baseline of 95% up to 100%, while compressing Tokens from 1.92M down to 1.56M.

Analysis of Cross-Model Execution Heatmaps

Researchers designed a set of extremely cruel 16-combination cross-tests: letting Claude, Gemini, GLM, and Minimax create skill libraries in 8 Hard-level tasks respectively, and then cross-distributing these pure Python scripts to all models for pure execution (modification permissions disabled).

The data contrast in these two heatmaps is extremely intense! Especially the first one, where code written by Claude achieved a 100% success rate when executed on all models.

Absolute Downward Compatibility of High-Quality Code: The code logic created by Claude is rigorous, and type checking is complete. When these scripts were distributed to relatively lighter models like Minimax, GLM, or Gemini for execution, the entire series achieved a perfect pass rate of 100%, bringing huge Token savings of 54% to 81% for all executor-side models.
Backfire Effect of Low-Quality Code Execution: In contrast, code generated by Minimax, due to chaotic internal state transitions and flaws in parameter interface design, frequently triggered underlying errors even when handed to Claude for execution. This caused the retry mechanism on the execution side to be frequently awakened; computing costs did not decrease but instead fluctuated between an increase of 48% and a slight reduction of 18%, completely destroying the framework's original intention of improving efficiency.
Compensation for Executor Compute Limitations: An interesting detail in the data is that when Claude forcibly executed the flawed skill library generated by Gemini, it achieved a Token saving of 69.2%; whereas when Gemini executed its own skill library, it only saved 14.8%. This indicates that the intent understanding and parameter completion capabilities of top-tier large models can, to some extent, "cover" defects in medium-quality code, but this is by no means a standard operation.

SkillCraft Conclusion 2: Therefore, when designing Multi-Agent systems with cloud-edge collaboration like Openclaw, it is best to 遵循 the principle of "Creator > Executor." Let the top-tier hundreds-of-billions parameter large models in the cloud act as "Skill Compilers," responsible for extracting, verifying, and encapsulating high-quality business scripts on a small number of complex samples; edge-side or low-cost privatized models are only authorized to act as "Execution Engines," responsible for receiving parameters and running cached scripts. Absolutely prohibit weak models from submitting code to the public skill library.

Pitfall Guide: Deep Nesting and the Cognitive Tax of Abstraction
When designing system architectures, engineers often carry an "object-oriented" obsession with modularizing code and layering calls. The micro-test data in the paper thoroughly punctures the fragility of current LLM capabilities in the face of deep code structures.

Crash Mechanism of Hierarchical Mode

Researchers introduced an Iteration Mode that allows skills to call other skills (with a maximum nesting depth set to 10). Theoretically, high-level skills are responsible for orchestration, and low-level skills are responsible for atomic operations, which seems perfect.

However, real execution logs (Log Trajectories) show a despairing process of error propagation.

In a task to generate a dog breed encyclopedia, the underlying data fetching skill get_breed_profile encountered rare breed data, where the API's raw response lacked the temperament field, thus returning a dictionary containing null. This underlying隐患 (hidden danger) was passed upward.

When the aggregation skill at the intermediate level attempted to execute the string operation profile.temperament.split(','), the sandbox directly threw a fatal TypeError: 'NoneType' has no attribute 'split'. This single line of type error instantly penetrated the entire call tree, causing the top-level assembly task to completely crash and fail entirely.

The actual test data was exceptionally brutal: after switching to nested mode, even the strong GPT-5.2 saw its task success rate drop from 90% in flat mode to 79%, and Token overhead rebounded and deteriorated from 0.26M to 0.60M. The reason is that nested architectures place demands on LLM edge-case handling and dynamic type checking that exceed their capability limits. The high cost of fault tracking far exceeds executing an atomic operation from scratch.

SkillCraft Conclusion 3: When building Agent tool call trees in industrial environments, it is best to abandon complex mutual calls between skills. Keeping the skill architecture flat and ensuring that individual scripts contain complete try-except error handling is the only solution to guarantee stability in production environments.

The Cognitive Compute Tax of Generalization and the Wisdom of "No Abstraction"

Facing all business flows, is encapsulating everything into general interfaces always the most efficient? The answer is again negative. Generalization itself requires paying a high "cognitive compute tax."

Researchers tested the Direct Exec mode, which gives up reuse and parameter interfaces and directly hard-codes values in single-use scripts. In 48 task subsets, the data contrast was extremely extreme:

Claude 4.5 Sonnet, in Direct Exec mode, squeezed Token consumption from a baseline of 1.72M down to an extremely exaggerated 0.16M, with dialogue interactions requiring only 5.8 turns; whereas to account for future reuse, the standard Skill mode, due to the need to design general parameter placeholders and execute the complete "save first, then execute" process, had an overhead of 0.34M Tokens and 10.5 interaction turns.
GPT-5.2's performance was equally extreme, with Direct Exec compressing single-task Token burning to 0.06M and interaction turns plummeting to 4.5.

By reviewing the original call trajectories, we can clearly see the difference in engineering intuition between models of different levels.

When handling lightweight tasks that only require obtaining 9 pieces of simple data for three types of cats, Claude keenly judged that the abstraction overhead of "designing a general interface and caching code" far exceeded the benefits of directly calling the bare API. It decisively abandoned calling save_skill and honestly completed 34 steps of atomic calls, consuming about 760k Tokens to successfully finish the job. In contrast, when handling high-complexity tasks requiring the analysis of five cocktails with a total of 25 calls, Claude used only 1 attempt to accurately write perfect macro code process_cocktail_complete containing 5 data flow paths, then continuously looped and called it 5 times, totaling only 8 interaction steps and an extremely low 210k Tokens to perfectly wrap up.

Conversely, DeepSeek-V3.2 was completely hijacked by system prompts during execution, falling into "instruction overfitting." In the simple cat task, it forcibly reinvented the wheel to generate a skill named process_cat_breed. However, the code quality was worrying, and the output dictionary hard-missed the breed_facts field. This led to large areas of data vacancies after execution, forcing it into a long repair period, manually performing 8 chunk write operations and other rescue measures, ultimately burning an absurd over 1.5M Tokens on a simple task. In the complex cocktail task, DeepSeek attempted to write scripts three times in a row (v1, v2, v3), failing due to basic syntax errors like unexpected token '}' at line 8 and 'return' is invalid outside function, finally having to downgrade to full manual calls, resulting in task failure amidst timeout and context confusion, consuming 1.14M Tokens with zero output.

SkillCraft Conclusion 4: The highest-dimension Agent framework design not only provides a code execution sandbox but must also grant large models the freedom and judgment of "when not to use the framework." Knowing how to identify low-frequency isolated tasks and choose to hard-code one-time scripts, avoiding hallucinations that may be triggered by handling parameter binding, is key to extreme cost reduction in the system.

Conclusion

The detailed data from "SkillCraft" has thoroughly changed our underlying perspective on evaluating and designing Agent architectures. In complex enterprise-level business flows, you need to shift your vision from purely chasing Accuracy in benchmark tests to extreme computing power cost control.

The real commercial moat for Agents truly landing lies in their ability to use code as an intermediate medium to achieve "Token Compression." Please remember these architectural rules bought with real money: use the strongest large models as code compilers, and use edge-side small models as executors; restrain code depth and stick to flat structures; and when designing frameworks, always leave a bypass channel that allows models to give up abstraction and execute directly and violently. Only in this way can your agent system truly balance engineering robustness with commercial feasibility.

The future has arrived; for those destined, let us walk together!