While everyone is eagerly awaiting DeepSeek-V4, a new research paper has caught the community's attention.
The paper introduces a new sparse attention mechanism called HISA (Hierarchical Indexing Sparse Attention). It breaks through the indexing bottleneck for 64K contexts, offering a 2-4x speedup compared to the DSA (DeepSeek Sparse Attention) currently used by DeepSeek.
Beyond the massive speed increase, HISA maintains near-perfect accuracy and is plug-and-play, requiring no retraining.
The researchers replaced the indexers in DeepSeek-V3.2 and GLM-5 directly without any fine-tuning. Accuracy in tasks such as retrieving key information and long-text understanding remained nearly identical to the original methods.
Eliminating Context Indexing Bottlenecks in Two Steps
The goal of this paper is clear: replace the "retriever" in the sparse attention mechanism of Large Language Models (LLMs) with a more efficient one.
Current mainstream token-level sparse attention, such as DSA, reduces computation costs by only calculating attention for key tokens. However, this design has a critical hidden flaw: to identify relevant tokens, an "indexer" must score every candidate token against all previous tokens and select the highest scores.
As text length (L) increases, the workload for this scoring grows quadratically (O(L²)). For example, doubling the length quadruples the workload. In ultra-long contexts, the quadratic cost of the indexer becomes the primary bottleneck, sometimes taking more time than the actual attention calculation.
The research team asked: Can we reduce the search cost of the indexer without changing the final sparse attention result?
Their solution, HISA (Hierarchical Indexing Sparse Attention), follows a simple core logic: instead of scoring every token individually, first filter out most irrelevant content using blocks, and then perform fine-grained selection within the remaining small blocks.
The tokens finally selected are identical to those selected by the original method. Consequently, the subsequent attention calculations require no modification. It is essentially a "more efficient sieve that catches the same materials."
The process consists of two steps, reusing the original model's scoring rules to ensure zero learning cost:
Step 1: Block-Level Coarse Filtering
- The long text is divided into fixed-size "token blocks" (e.g., 128 tokens per block). A "global feature vector" is calculated for each block, acting as a summary label.
- The original indexer's scoring method is used to score only these block labels.
- The top m blocks (e.g., 64 blocks) are selected, and all other blocks are discarded. Since the number of blocks is far smaller than the number of tokens, this drastically reduces the workload.
Step 2: Intra-Block Fine Selection
Within the m selected blocks, the original indexer rules are used to score individual tokens to pick the final k relevant tokens.
A minor optimization was added: the first and last blocks are always selected. This ensures that initial background information and the most recent context are not accidentally filtered out and handles boundary issues during text concatenation.
The key advantage of HISA is that it slashes complexity while remaining a "seamless replacement."
HISA reduces the computational cost of the original indexer from O(L²) to O(L²/B + L×m×B), where B is the block size and m is the number of selected blocks. The longer the text and the more accurate the block selection, the more pronounced the speedup.
Furthermore, it is highly engineering-friendly:
- The output is identical to the original indexer, so downstream attention modules remain unchanged.
- No retraining or changes to the KV cache structure are required; it is a direct replacement.
- For short texts, it automatically "degrades" to the original method, triggering hierarchical filtering only for ultra-long contexts, making it fully adaptive.
Empirical Results: Massive Speedups, Minimal Accuracy Loss
The team conducted comprehensive tests on two mainstream models, DeepSeek-V3.2 and GLM-5, with impressive results:
In terms of speed, with a text length of 64K, HISA achieved a maximum speedup of 3.75x over the original DSA indexer, with a typical speedup of over 2x.
Indexer latency dropped from approximately 5.6ms to 1.5ms, effectively eliminating the indexing bottleneck. This makes HISA perfectly suited for ultra-long context (128K/1M) applications.
In terms of accuracy, HISA almost entirely preserved the precision of the original DSA and significantly outperformed pure block-sparse methods.
The team performed a "Needle In A Haystack" test to measure the ability to accurately retrieve specific key information from ultra-long irrelevant text. HISA's accuracy was nearly identical to DSA, achieving near-perfect retrieval precision across all lengths and insertion depths.
On the LongBench benchmark for long-text understanding, HISA scores remained basically on par with DSA. In some scenarios, such as synthetic retrieval and few-shot learning where token selection precision is critical, HISA even slightly outperformed the original.
Hyperparameter tests showed that HISA remains stable across different block sizes and selection counts, with no significant performance variance compared to DSA.
This indicates that HISA is robust and does not require meticulous parameter tuning for production deployment.
The authors also identified a few areas for future improvement:
- Adaptive Blocks: Currently, blocks are fixed. If a block contains a mix of relevant and irrelevant content, the "global label" may be inaccurate. Future work could explore adaptive or overlapping blocks.
- Joint Training: Currently used only during inference; future iterations could train the block filter and model together.
- End-to-End Testing: Future tests will integrate HISA into a full LLM service framework to measure end-to-end throughput and latency.
Research Team Background
This research was led by Muhan Zhang's team at Peking University.
Muhan Zhang is a tenured Assistant Professor and PhD supervisor at the Institute for Artificial Intelligence, Peking University. Before returning to China, he was a researcher at Facebook AI (now Meta AI), focusing on large-scale graph learning systems.
With over 13,000 citations on Google Scholar, including two first-author papers with 3,100+ and 2,400+ citations, he has been listed among the top 2% of scientists globally by Elsevier for several consecutive years. Yufei Xu and Fanxu Meng are the co-first authors of the paper.
Reference: https://arxiv.org/abs/2603.28458