Have you ever noticed a problem? The more you use AI, the more it tends to forget things.
For instance, information gets lost once the context gets too long, it loses memory across different conversations, and carefully designed multi-turn dialogue logic starts spouting nonsense by the 8th turn.
This is because the memory capacity of current underlying models has hit a ceiling.
Even the most powerful large models today have an effective context window of around 1M tokens. Researchers estimate that the amount of information a human can store and recall over a lifetime is approximately 200-300 million tokens—a difference of two whole orders of magnitude.
The industry has tried two paths: extending the context window (where computational costs grow quadratically and have hit a wall) and mounting RAG (Retrieval-Augmented Generation), where retrieval and generation are disjointed, limiting accuracy.
Both paths have now reached a bottleneck.
Just as everyone was pondering the next breakthrough, the EverMind team quietly open-sourced an MSA (Memory Sparse Attention) project on GitHub.
It offers a completely different approach: instead of extending context or attaching external retrieval, it embeds memory directly into the attention mechanism itself.
Shortly after the project went live on GitHub, it attracted significant attention from developers, surging by over 2,600 Stars in just a few days.
GitHub: https://github.com/EverMind-AI/MSA
Understanding MSA in One Sentence
Compared to traditional RAG, MSA features a fundamentally different memory mechanism.
Traditional RAG is like equipping a model with an external hard drive, forcing it to search for information when needed. MSA is different; it’s like installing a native memory chip, making memory an intrinsic part of the model's capabilities.
This means retrieval and invocation are no longer two separate steps but are integrated within the same neural network, completed end-to-end.
The model learns for itself what to remember, how to find it, and how to use it. There is no manual rule intervention and no need for pipeline splicing or adaptation.
Furthermore, MSA is plug-and-play; developers only need to replace the Self-Attention layer of a standard Transformer without altering the overall model architecture.
The team has written a comprehensive technical article regarding the details, so we won't elaborate further here.
The key technical highlights are as follows:
Compression reduces storage for 100 million tokens to an acceptable range.
GPU handles routing indexes while CPU stores content details, making total capacity dependent on system memory (RAM) rather than VRAM.
Sparse routing reduces complexity from O(L²) to O(L).
Each document's positional encoding is numbered independently; training on 64K allows extrapolation to 100M.
How Powerful Is It?
No matter how much we discuss architectural design, data ultimately provides the most intuitive proof.
Built upon Qwen3-4B and undergoing 159B tokens of continuous pre-training, MSA boasts several core features:
It doesn’t just remember more; it remembers accurately.
Processing from over 10,000 tokens to 100 million tokens spans nearly four orders of magnitude, yet MSA’s response quality dropped by less than 9%.
To put it in perspective: while others start forgetting the beginning after finishing one book, MSA can read 100 epic novels and still accurately tell you details from Chapter 47 of the 3rd volume.
Small Model Beats Large Model.
In 9 standard Q&A tests, a 4-billion parameter MSA model outperformed traditional RAG solutions by an average of 16%.
Even more impressive, when pitted against a top-tier industry retriever combined with a massive 235-billion parameter model, MSA still won in multiple tests.
With a parameter difference of nearly 60x, the results were actually better. This makes it clear: when it comes to AI memory, choosing the right architecture is far more critical than simply scaling up the model size.
Extremely Low Hardware Barrier.
The project can run directly on a single machine equipped with two A800 GPUs. No cluster is needed, nor is any specialized hardware.
This means that starting now, small teams and even individual developers have the opportunity to utilize long-term memory capabilities on the scale of hundreds of millions of tokens.
Team Background and Development History
MSA comes from EverMind (under Shanda Group). The team previously created Omne, a multi-agent framework that achieved SOTA on the GAIA leaderboard, as well as the open-source memory platform EverOS.
While deploying Omne into real-world business scenarios, they discovered that an Agent's memory loss isn't a problem solvable at the framework level—it requires intervention at the model's foundation.
From project inception to paper completion, the process took over nine months and was far from smooth sailing.
The first version of the model performed poorly on tasks the team thought would be simple, leading to doubts about whether the direction itself was wrong.
The turning point came from a key insight: the information a model needs when “finding materials” differs from what it needs when “writing answers.”
Finding materials requires macro judgment: which part of this pile of documents relates to my question?
Writing answers requires micro details: which specific sentence answers my question?
Early versions forced a single mechanism to handle both tasks, resulting in poor performance on both fronts.
Once these two functions were separated, each handled by a dedicated module and paired with a more suitable training strategy, performance saw a qualitative leap.
The paper also candidly acknowledges current limitations: in scenarios requiring complex associations across multiple documents for deep reasoning, the pure internal memory solution still faces challenges.
This honesty and respect for technical boundaries actually inspires more confidence in the team's judgment and the project's long-term prospects.
Final Thoughts
If EverMind's technical theory can be truly implemented, many of the problems facing the AI industry today may be properly resolved.
From the moment true long-term memory is achieved, AI assistants will finally begin to truly know you.
It will remember your dietary preferences mentioned three months ago, the project progress you discussed last week, your child's personality, and your weekend travel preferences.
You won't need to repeat yourself every time; it will simply remember.
If this direction matures, AI partners with lifelong memory will no longer be science fiction.
Moreover, the product possibilities unlocked by long-term memory capabilities are vast.
Truly personalized AI education, medical assistants that track a patient's complete medical history, and enterprise knowledge bases that remember a decade of project accumulation.
Product forms that struggle today because models can't remember could become reality thanks to breakthroughs in the memory layer.
Finally, the MSA direction naturally leads to an imaginative possibility: Memory as a Service.
The memory layer can serve as an independent, pluggable module combined freely with various large models.
This means a user's memory assets are no longer locked into any single model or vendor.
In other words, you can switch models at any time, but your memory will always follow you.
I believe this may well become the next critical infrastructure direction for the AI industry.
Currently, the paper has been released, the code is open-sourced, and a model release is forthcoming.
Those interested in this cutting-edge technology are encouraged to give it a Star on GitHub and follow the latest open-source progress.
That concludes today's sharing. Thank you for taking the time to read. See you in the next issue. Respect!