Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Keshav Ramji ∗ , Tahira Naseem & Ramón Fernandez Astudillo IBM Research AI
Abstract
While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Nonverbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose Abstract Chain-of-Thought , a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen 'abstract' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to 11.6 × fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
1 Introduction
Large language models (LLMs) increasingly rely on long, explicit chains-of-thought (CoTs) to solve complex, multi-step reasoning problems. Despite its effectiveness, verbalized CoT (Wei et al., 2022; Kojima et al., 2022) is an expensive mechanism, increasing latency and cost at inference while bloating the length of traces during reinforcement learning (RL). Prior works also suggest that verbalized CoT can be unfaithful (Lanham et al., 2023; Turpin et al., 2023), while leveraging a different latent reasoning process that is not communicated. These drawbacks have motivated approaches to compress or internalize natural language CoT with more efficient intermediate representations (Cheng & Durme, 2024; Deng et al., 2024). Simultaneously, approaches focusing on pause or filler tokens (Goyal et al., 2024; Pfau et al., 2024) suggest that their addition facilitates deliberate internalized thinking through its activations. Furthermore, the findings of DeepSeek-R1-Zero (Guo et al., 2025) indicate that strong performance can be separable from human-readability, demonstrating gains even with language mixing in the CoTs. Recent works such as Coconut (Hao et al., 2025) have sought to enable reasoning mechanisms through continuous concept spaces, balancing efficiency and expressivity through principled methods for internalized recurrence.
In this work, we study a simple question: can we replace long verbalized rationales with a short sequence of discrete abstract tokens that functions as a latent scratchpad, while retaining the performance gains of CoT in response generation? We find that not only is this possible, but it can be achieved purely through post-training instruction-tuned
∗ Correspondence to keshav.ramji@ibm.com
Abstract Chain-of-Thought
Verbalized Chain-of-Thought

Q: A car travels A-B at 60km/h, rests 30min at B, then returns at 80km/h- The total trip takes 4h. Find the dis- tance from A to B.
Q: A car travels A-B at 60km/h, rests 30min at B, then returns at 80km/h. The total trip takes 4h. Find the dis- tance A to B from
thinking Step 1: Let d be the A-B distance (km) BB AE F BA AC AD Step 40.5+ 2: 60 80 Step 8: 7d/240 = 7/2 = d = 120km. 120km Answer: 120 km Answer: d Figure 1: Verbalized vs. Abstract Chain-of-Thought. Verbalized CoT (left) generates an explicit natural language rationale (Step 1 through Step 8) inside thinking · · · response tags before producing the answer. Abstract CoT (right) instead emits a short sequence of tokens from the reserved abstract vocabulary inside models. We propose Abstract Chain-of-Thought (Abstract-CoT) : instead of generating natural language reasoning, we induce the model to emit a bounded-length sequence of tokens from a reserved abstract vocabulary of distinguishable filler tokens. Abstract-CoT is designed to be token-efficient and non-verbal, producing short intermediate traces while offering an alternative to rationales generated in natural language. However, adding previously unseen tokens creates a cold-start problem, as their embeddings are randomly initialized and meaningless initially. While these tokens appear semantically uninformative, our recipe aims to learn to produce a sequence of these tokens, inducing new pathways between a prompt and a response. To this end, we adopt a twostage training recipe. The first stage is a policy iteration warm-up, alternating between verbal CoT guidance and direct on-policy generation of abstract token sequences. In the former, the final response only attends to the abstract tokens, not to the verbal CoT, forcing abstract token representations to learn useful information from the verbal CoT, serving as an information bottleneck. We then perform self-distillation by discarding the verbal CoTs and training only with on-policy-generated abstract sequences with the learned representations, and repeat this process iteratively. In the second stage, we apply reinforcement learning with a generative reward model to induce exploration over abstract token sequences and refine the abstract generation policy. Our findings demonstrate substantial gains in token efficiency while matching or outperforming verbalized chain-of-thought. We summarize our contributions below: Abstract Chain-of-Thought : We propose Abstract-CoT, a mechanism for reasoning through a vocabulary of reserved tokens introduced entirely in LLM post-training. Warm-up via Policy Iteration: We warm up the embeddings of the reserved tokens by alternating bottlenecked SFT and self-distillation, yielding an abstract generator. Warm-started RL for Abstract Policies: We optimize generation of abstract traces using GRPO, with constrained decoding to the abstract vocabulary. Token Efficiency: Abstract-CoT reduces reasoning tokens up to 11.6 × while matching verbalized CoT performance on MATH-500, AlpacaEval, and HotpotQA. Abstract Reasoning Language: We observe power-law dynamics over the abstract vocabulary, indicating that meaningful concepts and re-use patterns are learned. Works on filler tokens augment the token sequence with special tokens that are semantically uninformative (not human-readable natural language), but expand the model's effective computation in the forward pass. Goyal et al. (2024) introduce Compression, distillation, and partially removing the textual rationale (often through staged curricula) are some of the key mechanisms used to target the verbosity and cost of verbalized CoT. Early works such as Hsieh et al. (2023) demonstrated that explicit stepby-step rationales can be distilled to smaller models. Recent methods include seeking to directly shorten verbalized CoT via multi-round refinement (Yan et al., 2025) and learning to skip intermediate reasoning tokens in a controllable fashion while retaining generation quality (Xia et al., 2025). Approaches that compress parts of the rationale into a learned discrete or quantized representation are somewhat related to our discrete codebook. Su et al. (2025) combines latent tokens (learned via vector quantization) with remaining text tokens, inducing an efficiencyinterpretability trade-off. Complementary works perform step-wise compression of CoT into latent tokens (Zhang et al., 2025a) as well as gradually internalizing explicit steps into implicit computation (Deng et al., 2024) in a curriculum fashion. By contrast, our abstract tokens are not a quantized reconstruction of a teacher rationale, but are entirely in a newly introduced reserved vocabulary, with the model trained to use it as a compact reasoning language under constrained decoding. This allows the model to potentially explore other reasoning pathways , rather than being constrained to that of the teacher CoT. Some recent approaches seek to replace parts of the textual rationale with continuous thought states. Coconut (Hao et al., 2025) replaces some CoT tokens with continuous latent vectors derived from hidden states and trains the language model with a curriculum that gradually increases the latent segment and replaces verbalized CoT segments. CODI (Shen et al., 2025) similarly compresses CoT into a continuous space, using self-distillation to align latent trajectories with those induced by explicit rationales. System-1.5 reasoning (Wang et al., 2025) introduces dynamic shortcuts, traversing between language and latent spaces while aiming to reduce unnecessary verbal reasoning and retain controllability. Related 'soft' thinking approaches propagate intermediate representations by feeding distributions over embeddings as subsequent inputs (Xu et al., 2025; Zhang et al., 2025b). Recent works such as Butt et al. (2025) study training and optimization stability when such soft tokens are treated as decision variables through RL. Hybrid methods such as HybridCoT (Shen et al., 2026) explicitly interleave latent and text tokens to balance efficiency with partial interpretability. In our work, we suggest that it is possible to achieve the efficiency gains associated with latent reasoning while operating fully in the discrete token space. A complementary direction focuses on controlling inference-time cost by explicitly optimizing the reasoning budget, which is often operationalized as the length of intermediate reasoning traces. Recent work applies RL to learn when to expend additional reasoning steps as opposed to answering early; for example, by learning adaptive chain-of-thought triggering policies under compute constraints (Lou et al., 2025), or by pruning or shortening intermediate reasoning via training-time objectives that directly reward efficiency (Hou et al., 2026). Other recent approaches optimize a length-accuracy trade-off with RL objectives, by allocating token budgets dynamically (Kleinman et al., 2025) or by explicitly optimizing for consistency with user-specified length constraints (Aggarwal & Welleck, 2025). While our RL stage is most closely aligned with this line of work, it differs in the action space: instead of optimizing over free-form textual CoT length, we optimize over sequences constrained to a reserved discrete codebook. This enables control over the intermediate sequence while avoiding the brittleness of length control in open-ended natural language. Let x , c , and y denote a prompt, gold verbal chain-of-thought (CoT), and the target answer, respectively. We assume training data D = { ( xi , c i , yi ) } N i = 1 , where c i is only available during the first phase of warm-up. Let πθ be a causal decoder-only LLM with parameters θ and base vocabulary V . We extend the tokenizer with a set of M previously unseen (reserved) tokens in the abstract codebook, along with two delimiters
Thus, an abstract chain-of-thought is a token sequence z = ( z 1 , . . . , zm ) ∈ V m abs , formatted as:2 Related Work
2.1 Filler Tokens
2.2 CoT Compression, Distillation, and Discrete Codebooks
2.3 Continuous and Hybrid Latent Reasoning
2.4 Reinforcement Learning for Budget Control
3 Latent Reasoning with Abstract Chain-of-Thought
3.1 Problem Setup and Notation
We denote the maximum length of the abstract sequence by m ≤ m max; at inference-time, the model receives x and must generate ˜ z and y without access to c . Let Z denote the positions of the full abstract sequence ˜ z (including
We view the abstract trace z as a discrete latent variable, mediating reasoning. Ideally, we would like to maximize the marginal likelihood:
for a sequence of length ≤ m max, but the sum over discrete traces is intractable. Therefore, Abstract Chain-of-Thought uses a bootstrapping procedure that alternates (i) proposing an abstract trace z ∈ V ∗ abs with verbal CoT guidance, and (ii) updating the model given the generated trace, followed by distillation to learn to directly propose traces from x alone.
3.2 Warm-Up via Policy Iteration
The abstract tokens start with randomly initialized embeddings, so the model initially cannot exploit the bottleneck in the absence of a prior that enforces specific concept mappings.
1 We use alphabetical names , . . . , ; for M > 26, we continue with two letter identifiers (AA-ZZ), which may be similarly extended for larger abstract vocabularies.

Policy Iteration Warm-up Loop
Bottlenecked SFT
Warm-Started Reinforcement Learning
[Input]
[Verbal CoT]
[Abstract CoT]
[Response]
GRPO with Constrained Decoding
[Abstract CoT]
[Response]
Gold
Response
Rollout
Block-structured attention mask
[Input]
Reward
Self-Distillation
Model
Policy
[Input]
[Response]
<beginabstract?
Rollout k
On-policy Abstract CoT generation with constrained decoding
Figure 2: Abstract Chain-of-Thought : The training recipe consists of two stages: (i.) a warm-up loop, consisting of a Bottlenecked SFT phase with guidance from a teacher Verbal CoT, and a Self-Distillation phase with on-policy abstract sequence generation, repeated iteratively, and (ii.) reinforcement learning using GRPO with constrained decoding for the rollouts, which rewards abstract sequences that lead to a high-quality response.
Therefore, we perform an abstract embedding warm-up with a policy iteration loop over iterations t = 1, . . . , T ; each iteration produces a dataset of abstract trajectories ˜ z ( t ) and up- dates θ via SFT. The training dataset D is staged over the iterations: D = T ⋃ t = 1 { ( D t ,1 , D t ,2 ) } ).
Constrained Decoding. We use π abs θ to denote the policy restricted to an allowed token set A = V abs ∪ {
(1) Bottlenecked SFT with Abstract Tokens. Given ( x , c , y ) , we construct ˜ z ( t ) using a policy ϕ t . In the first iteration, we use random initialization; with S steps in the verbal CoT, we sample a random number of abstract tokens per CoT step (rand ( 1, | ℓ | 2 ) for | ℓ | tokens in step ℓ ∈ S ) and choosing the specific tokens uniformly at random from V abs . We analyzed other initialization schemes (alphabetically cycling through the tokens, enforcing a power-law distribution), and found a uniform distribution over the tokens to be most effective. In subsequent iterations ( t ≥ 2), abstract sequences are generated on-policy: ˜ z ( t ) ∼ π abs θ ( · | x , c ) under constrained decoding.
We form a single concatenated training sequence s = [ x ; c ; ˜ z ; y ] and define a blockstructured attention mask A that enforces an information bottleneck. Let indices be partitioned into prompt ( X ), verbal CoT ( C ), abstract sequence ( Z ) and answer ( Y ). The abstract tokens attend to the prompt and the verbal CoT ; that is:
Crucially, the answer only attends to the prompt and the abstract tokens, not to the verbal CoT , with all other entries following standard causal masking:
A i , j = { 1 i ∈ Y , j ∈ X ∪ Z ∪ Y≤ i 0 i ∈ Y , j ∈ C
Concretely, this training procedure can be seen as implementing a discrete latent bottleneck; let H Z abs denote the hidden states at the abstract token positions Z abs produced from
Algorithm 1 : Policy Iteration Warm-Up for Abstract-CoT
Require: Training data D = { ( x , c , y ) } , abstract vocabulary V abs , iterations T
- 1: Initialize θ ( 0 ) from base instruction-tuned model; add new token embeddings for V abs
- 2: for t = 1 to T do
- 3: Data for current iteration: D t ,1 , D t ,2 ⊂ D ( t )
- 4: Generate abstract traces ˜ z ( t ) ∼ ϕ t ( · | x , c , θ ( t -1 ) ) (random if t = 1, else constrained decoding with ϕ t = π θ ( t -1 ) for ( x , c , y ) ∈ D t ,1
- 5: Update ¯ θ ( t ) ← arg min θ E ( x , c , y ) ∼D t ,1 [ L SFT ( θ ; x , c , ˜ z ( t ) , y ; A ) ]
- 6: Distill: generate ˜ z ′ ∼ π abs ¯ θ ( t ) ( · | x ) for ( x , y ) ∈ D t ,2
- 7: Starting from ¯ θ ( t ) , update θ ( t ) ← arg min θ E ( x , y ) ∼D t ,2 [ L Distill ( θ ; x , ˜ z ′ , y ) ]
- 8: end for
- 9: return θ ( T )
the prefix [ x ; c ; ˜ z ] following masking of the verbal CoT. The only dependence of answer generation ( y ) on the verbal CoT ( c ) is through H Z abs , inducing conditional Markov structure:
By the data processing inequality, any dependence between y and c must be bounded by the information that can be transmitted through the abstract segment:
Since H Z abs scales linearly with the abstract sequence length m , tuning m max affects the channel capacity from c to y during warm-up.
We then optimize a masked SFT objective that trains on the abstract sequence 2 and the answer while hiding the verbal CoT with bottleneck attention mask A :
(2) Self-Distillation Without Verbal CoT. The bottlenecked SFT stage exploits c to shape the hidden states at abstract-token positions, but our target policy ultimately should produce abstract tokens from the prompt alone; this motivated the loss computation on ˜ z ( t ) . We create a distillation dataset by generating ˜ z ∼ π abs ( · | x ) via constrained decoding (with
θ m ≤ m max) and pairing it with the gold answer y : D ( t ) distill = { ( xi , ˜ zi , yi ) } N i = 1 . In the discrete latent bottleneck interpretation, the self-distillation and RL (Section 3.3) phases tune the model's inference-time thinking budget. We train with standard causal SFT on [ x ; ˜ z ; y ], where s j spans the abstract and response tokens:
We observe that the distribution learned via warm-starting exhibits a much clearer power-law shape compared to the more uniform13 distribution20 arising from the cold8-start RL. This indicates that Abstract-CoT effectively learns to allocate more computation to16 important31 tokens, leading14 19 to5 more efficient reasoning. Thelearned distributions suggest that warm-start is essential for1 achieving high sample efficiency with small models9 at7 inference13 time11.10 We3 also7 note2 that6 the Abstract-CoT method4 can be further1 scaled12 to larger model sizes,2 such10 as Q5wen12 3-32B, where27 it retains significant efficiency gains while14 matching performance10 on benchmarks6 like Alp4acaEval and HotpotQA.9 The truncation analysis12 also reveals15 that Abstract6-CoT is more robust to reduced thinking budgets than verbalized CoT, particularly6 on difficult tasks16 like MATH-500. Perm13 utation tests further27 demonstrate that RL training teaches the model to use the abstract vocabulary in a structured, composition10al manner, making it less robust to token shuffling compared tolamore naive9 versions.