Summary: IBM Research introduces Abstract Chain-of-Thought (Abstract-CoT), replacing lengthy natural language reasoning chains with a set of "abstract reasoning tokens." On MATH-500, reasoning tokens plummet from 1,671 to just 144 (an 11.6× compression), while accuracy reaches 90.8%, matching or exceeding a full CoT+RL approach. This isn't an approximation—it's a genuine method for letting models "think in their own language."
The Problem
The reasoning capabilities of large language models (LLMs) hinge on Chain-of-Thought (CoT), but the cost is generating massive amounts of natural language steps during inference—GPT-4o can easily produce over 1,500 tokens to solve a single MATH problem. As models scale up, these numbers continue to balloon.
Attempts have been made to replace text-based reasoning with "continuous representations" (like Pause Tokens), but their performance has consistently lagged behind explicit CoT. The core conflict is this: discrete tokens are excellent for reinforcement learning optimization, but the redundancy of natural language makes them extremely inefficient.
The Core Method
Abstract-CoT's approach is to reserve a block within the vocabulary (64 dedicated tokens) and train the model to use this abstract symbol system for intermediate reasoning, before generating the final answer in natural language.
The training process involves three steps:
Step 1: Bottleneck SFT
Natural language CoT → Masked compression → Abstract token sequence (SFT alignment)
Step 2: Self-Distillation
Directly generate abstract tokens from the prompt only (using constrained decoding)
Step 3: RL Fine-tuning
GRPO reinforcement learning + Constrained decoding → Maximize reward
The key insight: Natural language CoT is "human-readable," but the model doesn't actually need it internally. A compact set of abstract symbols is entirely sufficient.
Key Performance Comparison
| Method | MATH-500 | Reasoning Tokens | Compression Ratio |
|---|---|---|---|
| SFT + RL (Full CoT) | 92.6 | 1671 | — |
| Abstract-CoT (Warm-up + RL) | 90.8 | 144 | 11.6× |
| Pause Token | 78.6 | 142 | 11.7× |
| Stepwise Internalization | 88.6 | 169 | 9.9× |
Three points stand out:
Pause Tokens used a similarly low number of tokens, but their performance was 12 points worse—proving the quality of the abstraction, not the quantity, is what matters. The effect is even more stable when scaled up to Qwen3-32B: it achieved 94.6 on MATH-500 and 65.6 on AlpacaEval (surpassing full CoT), with an 11× token compression. Stronger robustness to truncation: When cut off, a traditional CoT model sees an 11.8-point performance drop, whereas Abstract-CoT drops by only 6 points.
Fascinating Discovery: Self-Organizing Abstract Reasoning Language
After training, researchers discovered that the frequency of abstract token usage follows a power-law distribution—a few symbols are reused at high frequency, while the vast majority appear only occasionally. This is strikingly similar to Zipf's law in natural languages.
This implies that the model did not use the 64 tokens randomly; it spontaneously learned a structured reasoning language.
Conclusion
The significance of Abstract-CoT goes beyond "being 11 times faster." It reveals something more fundamental: An LLM's reasoning ability is not intrinsically tied to natural language. Models can perfectly well think using a more compact symbol system, with natural language serving only as a "translation layer" for the final output.
As reasoning models (like o1, R1, and Qwen-thinking) are deployed at scale in production, the cost of reasoning tokens is becoming a core bottleneck. Abstract-CoT offers a clean and elegant solution—no changes to the model architecture required, ready to use after training.
Source: arXiv:2604.22709 [1] | IBM Research AI | 2026-04-24