Editor | Listening to Rain
Waking up, DeepSeek released a new paper again!
Looking closely at the authors, Liang Wenfeng's name also appears prominently among them.
This paper is titled "Conditional Memory via Scalable Lookups: A New Dimension of Sparsity for Large Language Models", and it focuses on proposing Engram — a conditional memory module designed to enhance the Transformer backbone by structurally separating static pattern storage from dynamic computation.
The experimental data provided in the paper is also quite stunning:
1. Engram brings significant performance improvements on knowledge, reasoning, code, and math tasks, all outperforming pure MoE models.
2. There is a U-shaped scaling law: Pure MoE performance is suboptimal, while allocating 20–25% of sparse parameters to Engram yields the best results.
3. Long context capabilities improve significantly, freeing up attention for global patterns and complex reasoning.
Both code and the full paper have been open-sourced:
Paper Address: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf
Code Address: https://github.com/deepseek-ai/Engram
Why do Large Language Models need Engram?
Sparsity has always been a core design principle for intelligent systems, whether it's the neural circuits of the biological brain or modern large language models using it to "save resources".
In AI, this idea is most commonly embodied in Mixture of Experts (MoE) models — which use "conditional computation" to activate only a portion of parameters, thereby multiplying model capacity without significantly increasing computation. MoE is currently one of the key technologies driving parameter scale and capability expansion, and DeepSeek's own series of models (such as DeepSeek V2, DeepSeek V3, etc.) also employ advanced MoE methods for scaling training.
But MoE has its limitations too. Language itself is very complex, involving at least two distinct types of tasks:
1. Compositional Reasoning: Requires deep, dynamic neural computation, such as understanding complex sentence structures or reasoning through problems;
2. Knowledge Retrieval: A vast amount of text consists of highly fixed, repetitive content, such as named entities, fixed expressions, and stylized patterns.
The paper proposes that classic N-gram models have already proven that for handling this kind of local, repetitive linguistic regularity, using a "lookup table" is the most efficient method, requiring almost no deep neural network involvement.
However, current Transformers do not have this native "lookup capability". Therefore, every time the model identifies a common multi-token entity, it consumes several layers of attention and feed-forward networks. This is like rebuilding a static dictionary repeatedly at runtime, which wastes computation and occupies the model's "sequence depth" that could otherwise be used for higher-level reasoning.
How is Engram implemented?
To address the above issues, DeepSeek proposed a new direction of sparsity — conditional memory, specifically designed to store and retrieve fixed knowledge. It is completely complementary to MoE's conditional computation:
- MoE is responsible for dynamic reasoning and compositional logic;
- Engram is responsible for static knowledge, which can be looked up directly.
Engram is a core concept in neuroscience, meaning "memory trace". It is a scalable, searchable memory module for language models to retrieve patterns or fragments that may have been seen in the past during the inference process.
In terms of specific implementation, the Engram module separates static pattern storage from dynamic computation through O(1) lookup complexity, utilizing four core technologies: modernized Hashed N-gram embeddings, tokenizer compression, context-aware gating, and multi-branch fusion techniques.
Specifically:
1. Tokenizer Compression: Pre-compute a mapping function to collapse semantically equivalent but ID-different tokens (such as "Apple" and "apple") into a unified identifier, reducing the effective vocabulary size by 23%.
2. Hash Retrieval: Use local context (N-grams) as keys to retrieve static vectors from a huge embedding table via hash functions.
3. Context-Aware Gating: This is Engram's key innovation. It uses the hidden state of the current layer as a Query to semantically match with the retrieved memory. If the retrieved content contradicts the context, the gating value approaches zero, thereby suppressing noise from hash collisions.
4. Hybrid Branch Integration: Optimized specifically for multi-branch architectures (like mHC), balancing expression capability and computational efficiency through parameter sharing strategies (sharing Embedding tables and Value projections while keeping independent Key projections).
Engram is usually inserted into the earlier layers of the Transformer, such as Layer 2 or Layer 6. The benefit of doing this is: on one hand, it offloads the work of reconstructing static patterns, reducing the burden on the backbone network; on the other hand, it retains enough context information, allowing the gating mechanism to more intelligently determine which memories to use and which to ignore.
Engram's memory capacity is not necessarily "the bigger the better"; it needs to be carefully matched with the MoE expert capacity. Following the Sparsity Allocation rule, the ratio between the two is reasonably divided to ensure the parameter utilization of the large model while maximizing computational efficiency — simply put, making every bit of memory and every expert play their biggest role.
Experimental results are stunning:
Significant improvements in reasoning, code, and long context capabilities
The paper scaled Engram to 27 billion parameters, strictly aligning parameter counts and FLOPs with the MoE baseline. The results show:
- Knowledge-intensive tasks (MMLU, CMMLU, MMLU-Pro): Performance improved by 1.8–4.0 points;
- General reasoning tasks (BBH, ARC-Challenge, DROP): Improvements are more pronounced, up to +5 points;
- Code and math capabilities (HumanEval, MATH, GSM8K): Average improvement of 2–3 points.
Notably, Engram significantly outperforms pure MoE models in knowledge-intensive tasks. The reason is intuitive: it delegates the memory of static patterns to an efficient lookup mechanism instead of "recalculating" with the neural network every time, reducing repetitive computation in the shallow layers.
More importantly, Engram also significantly extends long context capabilities, performing outstandingly in long-text tasks (such as LongPPL, RULER), especially in scenarios like multi-hop retrieval and chain-of-thought reasoning. For instance, the Multi-Query NIAH metric improved from 84.2 to 97.0, and Variable Tracking improved from 77.0 to 89.0.
The reason is that Engram handles a large amount of local and static dependencies, freeing up the attention mechanism to process global context, resulting in greater stability and accuracy in long sequences.
In addition, the team also discovered a U-shaped scaling law in the capacity allocation between MoE and Engram:
- When Engram memory capacity is too small or too large, performance is not ideal;
- Allocating 20–25% of sparse parameters to Engram yields the best results.
Netizens: Engram might be the foundational technology for the DeepSeek-V4 model!
On platforms like Reddit and X, DeepSeek's new paper immediately sparked heated discussions among netizens.
The most widespread guess is: Engram might be the foundational technology for the upcoming DeepSeek-V4.
Many netizens believe that Engram is an interesting method, characterized by separating the responsibilities of "memory pattern lookup" and "neural computational reasoning" within the model architecture, thereby opening up a new direction of sparsity.
Some netizens also stated that this method is much better than linear attention mechanisms.
DeepSeek's late-night move also led some netizens to exclaim: The innovativeness of Chinese large model teams is truly scary.
So, what do you all think about DeepSeek's new technology?
Welcome to leave your views in the comments section.