Skills-Driven Reasoning Paradigm: Tsinghua & Peking University Propose TRS, Saving 59% Tokens Without Accuracy Drop

Hi everyone, I'm PaperAgent, not an Agent!

Current reasoning models (like OpenAI o1, DeepSeek-R1) achieve astonishing accuracy but often generate thousands or even tens of thousands of words of "thinking processes," causing inference costs and latency to skyrocket.

Image illustrating token consumption crisis

Proposed by Qiyuan Tech, Tsinghua University, Peking University, and others, the TRS (Thinking with Reasoning Skills) framework is a training-free, black-box compatible method. By distilling historical reasoning trajectories into reusable skill cards and retrieving them for injection during inference, it achieves the counter-intuitive breakthrough of fewer tokens and higher accuracy. On math and programming tasks, token consumption decreased by 6%-59%, while accuracy did not drop but actually increased.

1. The Token Inflation Crisis of Reasoning Models

Modern Large Reasoning Models (LRMs) have significantly improved reliability in math and code through explicit intermediate thinking (Chain-of-Thought), but they have also introduced a production-level bottleneck: test-time computation cost is directly proportional to the number of tokens.

Taking commercial API billing models as an example, output tokens are often more expensive than input tokens. When models face complex problems, they generate a large number of redundant verification, trial-and-error, and backtracking loops. Industry reports also confirm that reasoning-intensive workloads are significantly amplifying infrastructure pressure.

The image above shows the long, error-prone path of standard CoT when calculating .

Existing speed-up solutions (like Chain-of-Draft, TALE, NoWait) are essentially doing the same thing: making models "think shorter". However, forced compression of the thinking space often leads to an efficiency-accuracy trade-off—saving tokens on easy problems but completely failing on hard ones.

The Core Question: Can we prevent the model from "deriving from scratch" every time, and instead, like a human expert, enable it to directly call upon accumulated problem-solving experience?

2. Core Insight: From "Reasoning from Scratch" to "Recalling Skills"

Human experts rarely derive everything from first principles when solving problems. They rely on reusable skills abstracted from past practice (e.g., "look for invariants," "two-pointer technique," "chain rule"). TRS systematizes this cognitive model:

Offline: Distill long trajectories (including successful paths and lessons from failures) from the model solving historical problems into structured Skill Cards.
Online: When facing a new problem, retrieve the most relevant skill cards and inject them into the prompt, guiding the model down a "direct path."

Diagram comparing standard CoT and TRS approach

Standard CoT requires a token-expensive exploration of "integration by parts → trig substitution → trial and error" when solving an integral; whereas, after TRS retrieves the "chain rule + u-substitution" skill, it produces the solution in three direct steps, dramatically reducing token cost.

3. Method in Detail: The TRS Framework

3.1 Skill Card Design (Skill Card Schema)

Each skill card is a highly structured, compact text containing five fields (see the paper's Appendix A for details):

Field	Meaning
Trigger	Keywords that trigger applicability for a scenario (e.g., "the integral form contains
Do	Core operational steps (a minimal executable recipe)
Avoid	Anti-patterns / Common pitfalls
Check	Constraints or invariants that must be verified
Risk	Edge cases and failure modes

For correct solution trajectories, the card abstracts the successful pattern; for incorrect trajectories, the card abstracts an "anti-pattern → correction strategy." This "learn from failure" design is key to TRS improving accuracy on difficult problems.

3.2 Offline Skill Library Construction

For a source problem , run the reasoning model to obtain the trajectory and result .
Use a stronger distillation model (like Gemini Flash) to compress into a skill card and 10-20 retrieval keywords .
Store it in the skill library in a Key-Value format: Key = Concat(problem, keywords), Value = skill card.

The paper validates using DEEPMATH-103K (93K for building the library, 10K for testing) and NEMOTRON-COMPETITIVEPROGRAMMING-V1 (26.6K for building, 1K for testing).

3.3 Online Retrieval and Injection

When facing a new query :

Retrieval: Use BM25 (for math) or Hybrid (BM25 + Dense Embedding, for code) to retrieve the top-k skills.
Injection: Prepend the skill cards to the prompt (Figure 13 in the paper shows a standard template).
Lightweight Gating: The prompt includes an arbitration instruction—"Only use directly applicable skills; ignore irrelevant or contradictory suggestions."

Why can it reduce tokens? Although injecting skills increases input length, it eliminates redundant exploration branches, trial-and-error loops, and repetitive verification. Experiments show that the reduction in output tokens far outweighs the increase in input tokens, leading to an overall decrease in end-to-end cost and latency.

Diagram of the online retrieval and injection process

4. Main Experiments: Breaking the Efficiency-Accuracy Trade-off

4.1 Mathematical Reasoning (DeepMath-103K)

Demonstrates the impressive performance of TRS across multiple models:

Table showing TRS performance on math tasks

Key Findings:

Doubao Seed halved its token count with almost no accuracy loss (-0.2%), reducing cost by 53.8%.
Weaker models like GPT-4o-mini, with TRS augmentation, saw accuracy improve by +1.8% and cost decrease by 6.9%.
GPT-OSS-120B maintained the same accuracy while reducing cost by 16.9%.

4.2 Competitive Programming (Code Competitions)

TRS also performed robustly on programming tasks:

GPT-4o-mini: Accuracy from 22.0% → 24.4% (+2.4%), cost ↓6.3%
Doubao Seed-2.0: Accuracy from 63.6% → 64.4% (+0.8%), cost ↓6.0%
GPT-OSS-120B: Accuracy from 54.2% → 58.3% (+4.1%). Although the increased prompt caused a slight cost increase of +4.8%, the accuracy improvement was significant.

Scatter plot showing TRS advantage on code tasks This intuitively shows the comprehensive advantage of TRS over Direct prompting in both token usage and accuracy on code tasks.

5. In-Depth Analysis: Why Does TRS Win?

5.1 Greater Advantage on Hard Problems: Comparison with TALE/CoD/NoWait

Existing speed-up methods (TALE's forced budget, CoD's minimalist draft, NoWait's suppression of reflection words) generally suffer from catastrophic collapse on difficult problems.

Line chart comparing TRS performance across difficulty levels Slicing by the baseline thinking length (difficulty threshold ) reveals:

As increases (problems get harder), the accuracy of TALE and CoD drops sharply.
TRS on GPT-OSS, in the hardest interval (), boosted accuracy from ~45% to ~80%, while compressing tokens from ~15k to ~7k.

Conclusion: Forcing "thinking shorter" cripples deep reasoning. TRS, by providing a navigation map (skill cards), prevents the model from getting lost in complex solution spaces, naturally eliminating the need for long trial-and-error trajectories.

5.2 Controlled Experiment: It's Not Simple RAG

Ablation studies prove that TRS's gains cannot be explained by simple retrieval:

Ablation study results comparing skill cards vs. raw trajectories

Only the combination of structured skill cards + sufficient coverage unleashes TRS's potential. This indicates that what the model needs is not "relevant context," but executable, procedural guidance.

5.3 Cross-Model Transfer: Strong Models for Distillation, Weak Models Benefit

Table showing cross-model skill transfer results Shows cross-model skill transfer:

Using a skill library generated by Doubao for OSS, or vice versa, brought positive gains in both cases.
The greatest benefit came from aligning model styles (e.g., Doubao using a Doubao-generated library).
Cross-source skills sometimes even led to more aggressive token reduction.

Engineering Significance: Enterprises can use strong models (like GPT-4/Gemini) to distill a skill library offline and provide it for lightweight models (like GPT-4o-mini/Doubao) to retrieve during deployment, creating a cost structure based on "master experience, apprentice execution."

5.4 Retrieval Strategy: BM25 for Math, Hybrid for Code

Comparing retrieval backends:

Bar chart comparing retrieval backend performance

For math problems, surface trigger words (formulas, theorem names) have high lexical overlap, so BM25 suffices. For code problems, surface descriptions vary greatly, but algorithmic patterns are similar, requiring Dense Embedding to capture semantics. The paper defaults: BM25(k=1) for math, Hybrid(k=5) for code.

5.5 External Competition Math Transfer: AoPS Skill Library

To validate cross-domain generalization, the authors distilled 7,616 skill cards from the AoPS (Art of Problem Solving) contest problem bank and tested them on AIME 2024/2025/2026 and HMMT 2025.

Table showing AoPS transfer results for various models

Shows:

Out of 25 model-benchmark pairs, 13 saw accuracy improvement, and 20 saw cost reduction.
Doubao-1.8 saw an average accuracy improvement of +1.88% and a cost reduction of 2.8%.
Gemini-3-Flash improved accuracy but with a slight cost increase, indicating that skill injection for a strong model might trade increased input for output quality.

The benchmark-level averages in Table 6 show that the best transfer effect was for AIME 2024 I (+2.54%), while effects for the newer, tougher AIME 2026 tapered off. This indicates that the proximity of the skill library to the target domain remains a key factor.

https://github.com/stallone0000/Reasoning-Skill huggingface.co/datasets/stallone0000/Reasoning-Skill https://reasoning-skill.onrender.com

https://arxiv.org/pdf/2604.21764

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy