Abstract-CoT: Reasoning Tokens Slashed 11.6x, Chain-of-Thought Without Words Shatters LLM Efficiency Ceiling

Summary: IBM Research introduces Abstract Chain-of-Thought (Abstract-CoT), replacing lengthy natural language reasoning chains with a set of "abstract reasoning tokens." On MATH-500, reasoning tokens plummet from 1,671 to just 144 (an 11.6× compression), while accuracy reaches 90.8%, matching or exceeding a full CoT+RL approach. This isn't an approximation—it's a genuine method for letting models "think in their own language."

The Problem

The reasoning capabilities of large language models (LLMs) hinge on Chain-of-Thought (CoT), but the cost is generating massive amounts of natural language steps during inference—GPT-4o can easily produce over 1,500 tokens to solve a single MATH problem. As models scale up, these numbers continue to balloon.

Attempts have been made to replace text-based reasoning with "continuous representations" (like Pause Tokens), but their performance has consistently lagged behind explicit CoT. The core conflict is this: discrete tokens are excellent for reinforcement learning optimization, but the redundancy of natural language makes them extremely inefficient.

The Core Method

Abstract-CoT's approach is to reserve a block within the vocabulary (64 dedicated tokens) and train the model to use this abstract symbol system for intermediate reasoning, before generating the final answer in natural language.

The training process involves three steps:

Step 1: Bottleneck SFT
  Natural language CoT → Masked compression → Abstract token sequence (SFT alignment)

Step 2: Self-Distillation
  Directly generate abstract tokens from the prompt only (using constrained decoding)

Step 3: RL Fine-tuning
  GRPO reinforcement learning + Constrained decoding → Maximize reward

The key insight: Natural language CoT is "human-readable," but the model doesn't actually need it internally. A compact set of abstract symbols is entirely sufficient.

Key Performance Comparison

Method	MATH-500	Reasoning Tokens	Compression Ratio
SFT + RL (Full CoT)	92.6	1671	—
Abstract-CoT (Warm-up + RL)	90.8	144	11.6×
Pause Token	78.6	142	11.7×
Stepwise Internalization	88.6	169	9.9×

Three points stand out:

Pause Tokens used a similarly low number of tokens, but their performance was 12 points worse—proving the quality of the abstraction, not the quantity, is what matters.
The effect is even more stable when scaled up to Qwen3-32B: it achieved 94.6 on MATH-500 and 65.6 on AlpacaEval (surpassing full CoT), with an 11× token compression.
Stronger robustness to truncation: When cut off, a traditional CoT model sees an 11.8-point performance drop, whereas Abstract-CoT drops by only 6 points.

Fascinating Discovery: Self-Organizing Abstract Reasoning Language

After training, researchers discovered that the frequency of abstract token usage follows a power-law distribution—a few symbols are reused at high frequency, while the vast majority appear only occasionally. This is strikingly similar to Zipf's law in natural languages.

This implies that the model did not use the 64 tokens randomly; it spontaneously learned a structured reasoning language.

Conclusion

The significance of Abstract-CoT goes beyond "being 11 times faster." It reveals something more fundamental: An LLM's reasoning ability is not intrinsically tied to natural language. Models can perfectly well think using a more compact symbol system, with natural language serving only as a "translation layer" for the final output.

As reasoning models (like o1, R1, and Qwen-thinking) are deployed at scale in production, the cost of reasoning tokens is becoming a core bottleneck. Abstract-CoT offers a clean and elegant solution—no changes to the model architecture required, ready to use after training.

Source: arXiv:2604.22709 [1] | IBM Research AI | 2026-04-24