Editor's Note: Slow inference and excessive token consumption have long been two stubborn obstacles to the real-world deployment of large language models. On April 13, two new papers were simultaneously posted on arXiv, offering high-quality solutions from two distinct angles. CSAttention accelerates attention mechanisms for 128K long contexts by 4.6×, while STACK compresses reasoning chain tokens by 59.9% while actually improving accuracy by 4.8%. One tackles "slowness," the other addresses "length," creating a complementary pair with significant industry implications.
The Challenge of Attention Computation: Can 95% Sparsity Avoid Accuracy Loss?
The bottleneck in long-context inference has always centered on attention computation and KV Cache read/write operations. Sparse attention is a widely accepted solution, yet the industry has been stuck on a contradiction: the higher the sparsity, the greater the accuracy loss. Methods like H2O and SnapKV often suffer significant accuracy degradation when sparsity exceeds 80%.
CSAttention (Centroid Scoring Attention), from arXiv:2604.08584, directly confronts this contradiction. Its core insight is that Query distributions in long texts are non-uniform, allowing centroid clustering to predict which Keys hold high value in advance.
The specific approach trades "storage for computation"—completing the computationally heavy pre-filling phase offline. It constructs a fixed-size Query-Centroid lookup table for each request. During online decoding, it replaces full-context scanning with O(1) level table lookups while maintaining GPU-friendly score accumulation.
Experimental Results:
- For 128K context length, it achieves a 4.6× speedup compared to the most accurate sparse baseline.
- At 95% sparsity, accuracy remains nearly identical to Full Attention.
- Requires no training; it is plug-and-play.
This signifies that the longstanding curse of "sparsity inevitably leading to accuracy loss" has been broken by CSAttention.
Reasoning Chains Are Too Long: 60% of Tokens Are Essentially Fluff
On the other front, reasoning models (such as the DeepSeek-R1 series) have boosted accuracy through long Chain-of-Thought (CoT) processes, yet this introduces a new dilemma: "Overthinking." Models repeatedly self-verify, causing token counts to explode.
The STACK framework (State-Aware Reasoning Compression with Knowledge Guidance), detailed in arXiv:2604.09150, offers a fine-grained solution.
STACK's core judgment is that redundancy in reasoning chains is not uniformly distributed but concentrated in specific "states." It dynamically identifies the current reasoning state:
- Uncertain / Biased → Invokes retrieval augmentation to inject external knowledge guidance.
- Overly long but converged → Triggers self-prompting compression and early stopping.
These two modes switch dynamically based on confidence levels, with joint PPO+DPO training enabling the model to truly learn "when to stop."
Experimental Results (across three mathematical reasoning benchmarks):
- Average response length shortened by 59.9%.
- Accuracy actually increased by 4.8% (not a trade-off, but a win-win).
The Combined Value of These Two Papers
| CSAttention | STACK | |
|---|---|---|
| Target | Attention/KV Cache Acceleration | Reasoning Chain Token Compression |
| Method | Centroid Clustering + Offline Lookup | State-Aware Dual-Mode Compression |
| Training Requirement | No Training Needed | PPO + DPO |
| Core Benefit | 4.6× Latency Reduction | 60% Token Reduction |
| Stage | Pre-fill + Decode | Reasoning Generation |
Since both operate on different inference bottlenecks, they can theoretically be used in combination: CSAttention manages attention efficiency, while STACK controls reasoning chain length, forming an end-to-end acceleration suite.
For engineers deploying in long-context, high-frequency scenarios (such as agents, RAG, and legal document analysis), these two papers are worth close attention.
Source: arXiv:2604.08584 (CSAttention), arXiv:2604.09150 (STACK / Think Less, Know More)