A few weeks ago, @EverMind published a paper on "Memory Sparse Attention" (MSA), detailing a system capable of enabling models to remember up to 100 million tokens (equivalent to the length of 1,000 books) with less than 9% performance loss while still identifying correct answers. In multiple benchmark tests, this 4-billion-parameter model even outperformed RAG systems built on models 58 times its size.
The core concept is straightforward: instead of searching separate databases and hoping to find the correct information (as RAG algorithms do), MSA integrates memory directly into the model's cognitive framework. It learns end-to-end which information to retain and which to ignore, eliminating the need for a separate retrieval process.
Following the paper's release, the industry response was significant, with many calling for the source code to be open-sourced for research and collective progress. Yesterday, EverMind fulfilled its promise by fully open-sourcing the entire project code. In less than a day, it garnered 2.5K stars, highlighting the intense interest in this technology.
Project Overview
Long-term memory is the foundation of general intelligence, yet the computational bottlenecks of full attention limit the effective context length of most large language models to 128K–1M tokens. Existing solutions—hybrid linear attention, fixed-size state memory (e.g., RNNs), and external storage like RAG/Agents—either suffer from rapid accuracy degradation and increasing latency at extreme lengths, lack end-to-end differentiability or dynamic memory management, or require complex engineering pipelines. We propose Memory Sparse Attention (MSA): an end-to-end trainable, scalable sparse implicit state memory framework. Key concepts include:
- Scalable Sparse Attention + Document-level RoPE (Parallel/Global), achieving near-linear complexity in both training and inference;
- KV Cache Compression paired with a Memory Parallel inference engine, enabling throughput of 100 million tokens on 2×A800 GPUs;
- A Memory Interleaving mechanism supporting multi-turn, multi-hop reasoning across dispersed memory fragments.
In long-context Q&A and NIAH (Needle In A Haystack) benchmarks, MSA outperforms same-backbone RAG, top-tier RAG solutions, and leading long-context models. Across an unprecedented range of 16K to 100 million tokens, MSA exhibits less than 9% performance degradation, offering a viable path to decoupling memory capacity from reasoning capabilities.
Scaling from 16K to 100 Million Tokens: MSA fuses top-k selection with sparse attention, allowing document decoupling during inference while maintaining end-to-end differentiability. On MS MARCO, MSA shows less than 9% performance drop and demonstrates strong extrapolation capabilities. Some baseline curves end prematurely due to context length limits.
Figure 1: Scalability of MSA under ultra-long contexts
Core Contributions
- Memory Sparse Attention (MSA): An end-to-end trainable, scalable sparse attention layer combined with document-level RoPE, achieving O(L) complexity with less than 9% performance drop from 16K to 100 million tokens.
- KV Cache Compression + Memory Parallelism: Hierarchical storage (routing keys on GPU, content K/V on CPU) with distributed scoring and on-demand transmission, enabling 100 million token inference on 2×A800 GPUs.
- Memory Interleaving: Adaptively alternates between "generative retrieval → context expansion → generation," significantly enhancing multi-hop reasoning across documents.
- Comprehensive Evaluation: MSA surpasses same-backbone RAG, top-tier RAG schemes, and leading long-context models on long-context Q&A and NIAH benchmarks, demonstrating superior stability and accuracy at scale.
Overall Design
Architecture
MSA integrates retrieval and generation into a single differentiable closed loop. Document implicit states (K/V/Kᵣ) are compressed via chunked mean pooling. A routing projector calculates relevance using cosine similarity (averaging over attention heads, then taking the max over tokens) to select the Top‑k documents. Their compressed K/V is then concatenated with the query's local K/V for autoregressive decoding. Routing applies only to the upper layers; lower layers maintain independent document processing to achieve hierarchical alignment.
- Parallel (Document-level) RoPE: Position indices reset to 0 for each document, preventing position drift between short training and long inference, allowing 64k training to extrapolate to 100 million.
- Global RoPE (Active Context): The query's start index is offset by k (number of Top‑k retrieved blocks) to maintain causal order: Background → Query → Generation.
Figure 2: MSA Layer (Sparse Attention + Document-level RoPE)
Figure 2: Memory Sparse Attention layer with Parallel/Global RoPE
Inference Pipeline
MSA employs a three-stage pipeline (Figure 3):
- Global Memory Encoding (Offline): Forward pass over the corpus to cache chunked pooled (K̄, V̄, K̄ᵣ).
- Online Routing & Context Assembly: Project the query to Qᵣ, match with K̄ᵣ to select Top‑k, then load only the selected K̄/V̄ and concatenate with local context.
- Sparse Generation: Perform autoregressive generation on the sparse context.
Memory Parallelism shards K̄ᵣ across multiple GPUs (broadcast query → local scoring → global reduction). Content K̄/V̄ resides in host memory and is asynchronously fetched when selected—balancing VRAM and throughput for 100 million token deployments.
Figure 3: Three-stage inference and memory interleaving
Figure 3: Offline encoding → Online routing → Sparse generation; optional multi-turn interleaving for multi-hop reasoning
Experimental Results
Experimental Setup
Q&A: 9 datasets (MS MARCO v1, NQ, DuReader, TriviaQA(10M), NarrativeQA, PopQA, 2WikiMultiHopQA, HotpotQA, MuSiQue), memory bank size 277K→10M tokens, metric: LLM Score (0–5).
NIAH (RULER): 8 subtasks, 32K→1M tokens, reporting average accuracy.
Backbone Model: Qwen3‑4B‑Instruct‑2507. Compared against same-backbone RAG and top-tier RAG solutions (KaLMv2 + Large LLM Generator, with optional re-ranking).
Table 2: MSA vs. Same-Backbone RAG (Qwen3‑4B)
Summary: Average score 3.760, improving by +16.0% over standard RAG, +11.5% over RAG+Re-ranking, and +14.8% over HippoRAG2 (best @k for each); MSA leads on all datasets except NarrativeQA within the same-backbone group.
| Dataset | Tokens | Qwen3-4B R@1 | R@5 | R@10 | Qwen3-4B (RR) R@1 | R@5 | R@10 | HippoRAG2 R@1 | R@5 | R@10 | MSA (Adaptive) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MS MARCO v1 | 7.34M | 2.893 | 3.011 | 3.005 | 2.934 | 3.032 | 3.017 | 2.676 | 3.005 | 3.019 | 4.141 |
| Natural Questions | 1.47M | 3.452 | 3.374 | 3.297 | 3.494 | 3.408 | 3.385 | 3.338 | 3.389 | 3.374 | 3.545 |
| DuReader | 277K | 3.726 | 3.579 | 3.594 | 3.848 | 3.618 | 3.607 | 2.941 | 3.485 | 3.415 | 4.155 |
| TriviaQA (10M) | 10M | 4.133 | 4.414 | 4.273 | 4.313 | 4.375 | 4.391 | 4.188 | 4.430 | 4.367 | 4.621 |
| NarrativeQA | 538K | 1.611 | 2.567 | 2.860 | 3.638 | 3.492 | 3.536 | 1.959 | 2.628 | 2.655 | 3.395 |
| PopQA | 1.18M | 2.959 | 3.273 | 3.299 | 3.315 | 3.264 | 3.266 | 3.111 | 3.249 | 3.249 | 3.433 |
| 2WikiMultiHopQA | 722K | 1.065 | 3.055 | 3.136 | 1.187 | 3.057 | 3.159 | 1.045 | 3.180 | 3.330 | 4.280 |
| HotpotQA | 1.35M | 2.252 | 3.582 | 3.787 | 2.642 | 3.990 | 4.022 | 3.230 | 3.770 | 3.970 | 4.061 |
| MuSiQue | 1.41M | 0.936 | 1.752 | 1.928 | 1.144 | 1.960 | 1.965 | 1.020 | 1.907 | 2.095 | 2.211 |
| Average | — | 2.559 | 3.179 | 3.242 | 2.946 | 3.355 | 3.372 | 2.612 | 3.227 | 3.275 | 3.760 |
Table 2: Same-backbone RAG vs. MSA (@1/@5/@10 vs. MSA @Adaptive)
Table 3: MSA vs. Top-Tier RAG (Large Backbone Models)
Summary: Compared to KaLMv2+Qwen3‑235B and KaLMv2+Llama‑3.3‑70B (with/without re-ranking), MSA achieved the highest scores on 4/9 datasets. With an average score of 3.760, it improved upon the strongest configurations by +7.2%, +5.0%, +10.7%, and +5.4% respectively. Gaps on a few datasets (e.g., MuSiQue) are primarily due to differences in parameter count and inherent reasoning capabilities.
| Dataset | KaLMv2 + Qwen3‑235B R@1 | R@5 | R@10 | Qwen3‑235B (RR) R@1 | R@5 | R@10 | KaLMv2 + Llama‑3.3 R@1 | R@5 | R@10 | Llama‑3.3 (RR) R@1 | R@5 | R@10 | MSA (Adaptive) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MS MARCO v1 | 2.846 | 3.028 | 3.027 | 2.886 | 3.020 | 2.995 | 2.649 | 2.904 | 2.919 | 2.881 | 2.955 | 2.952 | 4.141 |
| Natural Questions | 3.711 | 3.670 | 3.694 | 3.621 | 3.610 | 3.645 | 3.675 | 3.674 | 3.662 | 3.756 | 3.665 | 3.647 | 3.545 |
| DuReader | 4.044 | 3.991 | 3.978 | 3.973 | 3.932 | 3.891 | 4.051 | 3.846 | 3.742 | 3.967 | 3.776 | 3.780 | 4.155 |
| TriviaQA (10M) | 4.367 | 4.656 | 4.578 | 4.492 | 4.320 | 4.555 | 4.273 | 4.740 | 4.719 | 4.547 | 4.703 | 4.695 | 4.621 |
| NarrativeQA | 1.413 | 2.130 | 2.427 | 3.212 | 3.427 | 3.375 | 1.290 | 2.123 | 2.382 | 3.150 | 3.263 | 3.317 | 3.395 |
| PopQA | 2.810 | 3.347 | 3.396 | 3.268 | 3.380 | 3.376 | 2.787 | 3.298 | 3.305 | 3.337 | 3.384 | 3.362 | 3.433 |
| 2WikiMultiHopQA | 2.646 | 3.579 | 3.582 | 1.855 | 3.381 | 3.583 | 1.339 | 3.263 | 3.445 | 1.651 | 3.332 | 3.541 | 4.280 |
| HotpotQA | 3.497 | 4.090 | 4.225 | 3.341 | 4.141 | 4.194 | 3.070 | 3.896 | 4.127 | 3.428 | 4.145 | 4.203 | 4.061 |
| MuSiQue | 1.988 | 2.462 | 2.647 | 1.801 | 2.522 | 2.605 | 1.704 | 2.317 | 2.258 | 1.895 | 2.462 | 2.614 | 2.211 |
| Average | 3.036 | 3.439 | 3.506 | 3.161 | 3.526 | 3.580 | 2.760 | 3.340 | 3.396 | 3.179 | 3.521 | 3.568 | 3.760 |
Table 3: Top-tier RAG solutions (Strong Retriever + Large Generator + Optional Re-ranking) vs. MSA
Figure 4: RULER NIAH Stability (32K→1M)
Summary: MSA maintains 94.84% accuracy even at 1 million tokens. The unmodified backbone model drops sharply after 128K (only 24.69% at 1M). Hybrid linear attention long-context models degrade significantly at ≥128K/256K. External memory agents (e.g., RL‑MemoryAgent‑14B) are relatively stable but have lower absolute accuracy and greater decay than MSA.
Figure 4: Accuracy vs. Context Length (Higher is better)
Implementation Details
- Training: 158.95 billion tokens of continuous pre-training using auxiliary routing loss, followed by two-stage SFT (8k→64k curriculum learning).
- Ablation Studies (Paper Table 4): Curriculum expansion, memory interleaving, continuous pre-training, and raw text injection all contribute significantly; removing them causes performance drops ranging from 5% to 37%.
Quick Start
For full instructions (project structure, supported benchmarks, etc.), please refer to QUICK_START.md.
1. Installation
conda create -n msa python=3.12 -y && conda activate msa
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation2. Download Model
mkdir ckpt
huggingface-cli download --resume-download EverMind-AI/MSA-4B --local-dir ckpt/MSA-4B3. Download Benchmark Data
Benchmark data is hosted at EverMind-AI/MSA-RAG-BENCHMARKS and will automatically download to the data/ directory upon first run.
4. Run
# Run inference on benchmarks
bash scripts/run_benchmarks.sh eval_benchmark
# Calculate LLM scores
bash scripts/calculate_llm_score.sh eval_benchmark