Long Context Reduced by 60% + 95% Sparsity: A Double Breakthrough Today Sets New Records in Inference Efficiency

Editor's Note: Slow inference and excessive token consumption have long been two stubborn obstacles to the real-world deployment of large language models. On April 13, two new papers were simultaneously posted on arXiv, offering high-quality solutions from two distinct angles. CSAttention accelerates attention mechanisms for 128K long contexts by 4.6×, while STACK compresses reasoning chain tokens by 59.9% while actually improving accuracy by 4.8%. One tackles "slowness," the other addresses "length," creating a complementary pair with significant industry implications.

The Challenge of Attention Computation: Can 95% Sparsity Avoid Accuracy Loss?

The bottleneck in long-context inference has always centered on attention computation and KV Cache read/write operations. Sparse attention is a widely accepted solution, yet the industry has been stuck on a contradiction: the higher the sparsity, the greater the accuracy loss. Methods like H2O and SnapKV often suffer significant accuracy degradation when sparsity exceeds 80%.

CSAttention (Centroid Scoring Attention), from arXiv:2604.08584, directly confronts this contradiction. Its core insight is that Query distributions in long texts are non-uniform, allowing centroid clustering to predict which Keys hold high value in advance.

The specific approach trades "storage for computation"—completing the computationally heavy pre-filling phase offline. It constructs a fixed-size Query-Centroid lookup table for each request. During online decoding, it replaces full-context scanning with O(1) level table lookups while maintaining GPU-friendly score accumulation.

Experimental Results:

For 128K context length, it achieves a 4.6× speedup compared to the most accurate sparse baseline.
At 95% sparsity, accuracy remains nearly identical to Full Attention.
Requires no training; it is plug-and-play.

This signifies that the longstanding curse of "sparsity inevitably leading to accuracy loss" has been broken by CSAttention.

Reasoning Chains Are Too Long: 60% of Tokens Are Essentially Fluff

On the other front, reasoning models (such as the DeepSeek-R1 series) have boosted accuracy through long Chain-of-Thought (CoT) processes, yet this introduces a new dilemma: "Overthinking." Models repeatedly self-verify, causing token counts to explode.

The STACK framework (State-Aware Reasoning Compression with Knowledge Guidance), detailed in arXiv:2604.09150, offers a fine-grained solution.

STACK's core judgment is that redundancy in reasoning chains is not uniformly distributed but concentrated in specific "states." It dynamically identifies the current reasoning state:

Uncertain / Biased → Invokes retrieval augmentation to inject external knowledge guidance.
Overly long but converged → Triggers self-prompting compression and early stopping.

These two modes switch dynamically based on confidence levels, with joint PPO+DPO training enabling the model to truly learn "when to stop."

Experimental Results (across three mathematical reasoning benchmarks):

Average response length shortened by 59.9%.
Accuracy actually increased by 4.8% (not a trade-off, but a win-win).

The Combined Value of These Two Papers

	CSAttention	STACK
Target	Attention/KV Cache Acceleration	Reasoning Chain Token Compression
Method	Centroid Clustering + Offline Lookup	State-Aware Dual-Mode Compression
Training Requirement	No Training Needed	PPO + DPO
Core Benefit	4.6× Latency Reduction	60% Token Reduction
Stage	Pre-fill + Decode	Reasoning Generation

Since both operate on different inference bottlenecks, they can theoretically be used in combination: CSAttention manages attention efficiency, while STACK controls reasoning chain length, forming an end-to-end acceleration suite.

For engineers deploying in long-context, high-frequency scenarios (such as agents, RAG, and legal document analysis), these two papers are worth close attention.

Source: arXiv:2604.08584 (CSAttention), arXiv:2604.09150 (STACK / Think Less, Know More)

Long Context Reduced by 60% + 95% Sparsity: A Double Breakthrough Today Sets New Records in Inference Efficiency

The Challenge of Attention Computation: Can 95% Sparsity Avoid Accuracy Loss?

Reasoning Chains Are Too Long: 60% of Tokens Are Essentially Fluff

The Combined Value of These Two Papers

Related Articles

分享網址