In recent years, the real bottleneck constraining the large model industry has shifted from parameter scale to inference efficiency.
As models grow larger, the pressure on compute and memory increasingly resembles a wall standing before deployment.
Especially as applications evolve from simple conversations to long document understanding, codebase-level analysis, and long-horizon agent tasks, the bottlenecks of traditional architectures are becoming increasingly exposed.
Many teams are optimizing training techniques, compressing weights, and performing distillation, yet few dare to modify the underlying attention structure directly, as that means re-examining the entire Transformer paradigm.
ModelBest's recent unveiling of the Linear-Sparse hybrid Attention architecture (SALA) and the corresponding model MiniCPM-SALA offers a different answer at the core architectural level.
Performance improvements in new models are no longer surprising, but what if this represents a fundamental recalibration at the architectural route level?
With the New Year approaching, while many companies are busy with red envelope promotions and marketing tactics, I believe ModelBest's solid technical progress deserves more attention.
01. The Explosion of Long-Context Demands Is Forcing Attention Mechanisms to Evolve
Early commercial scenarios for large models focused on Q&A, writing, and summarization, with context lengths typically ranging from thousands to tens of thousands of tokens—manageable for traditional full attention mechanisms.
As application patterns evolve, models are now tasked with codebase analysis, contract review, scientific literature synthesis, and long-chain agent task planning...
Input scales have jumped directly to hundreds of thousands or even millions of tokens, with such demands growing simultaneously in enterprise and edge-side environments.
Enterprises want models that can read entire knowledge bases or codebases in one go while maintaining consistent understanding, while edge devices store vast amounts of user private data.
For example, chat histories, behavioral logs, and location trajectories—this information must be processed locally to meet privacy requirements, making long-context capability on edge devices an increasingly essential metric.
The problems of traditional Transformers become infinitely amplified in such scenarios.
The computational complexity of full attention mechanisms grows quadratically with sequence length—ten times the sequence length means roughly a hundredfold increase in compute requirements, while the KV Cache expands proportionally.
Memory usage during inference continues to climb; even with constant model parameters, longer contexts will directly cause out-of-memory errors—this is the KV Cache dilemma that many engineering teams speak of.
Previously, these dilemmas formed an impossible triangle.
The industry has tried multiple paths to solve this, such as linear attention, state-space models, and sparse attention structures—each performing well in certain dimensions yet suffering from obvious shortcomings.
Linear attention and SSM-based methods reduce complexity to linear levels, performing ideally in speed and resource consumption. However, these structures must compress historical information into fixed-capacity states—the longer the sequence, the lower the weight of early information, leading to memory decay in complex reasoning or long-chain logical scenarios.
Sparse attention takes a different route.
By computing attention only at key positions to reduce computational demands, inference speed improves significantly. However, historical KV must be preserved completely; otherwise, long-range dependencies cannot be traced back.
This causes memory to still grow linearly with sequence length, failing to fundamentally resolve storage pressure.
The industry has been searching for a structure that retains the efficiency of linear architectures while possessing the precise perception capabilities of sparse structures.
The Transformer-to-Hybrid low-cost construction method (HALO) used by MiniCPM-SALA
02. SALA's Hybrid Attention Architecture
SALA's core design concept is straightforward: splice the advantages of both attention types into a single architecture, letting different modules handle different tasks.
Approximately 75% of the overall structure uses Lightning Attention linear modules, responsible for capturing local key semantics;
The remaining 25% uses InfLLM v2 sparse attention modules, responsible for global information modeling.
This ratio was determined after multiple rounds of experimental tuning, aiming to find a stable balance between efficiency and accuracy.
The linear portion ensures stable growth in inference complexity, while the sparse portion handles fine-grained modeling of high-value information; together they form a complete context understanding pathway.
This design truly addresses the structural contradictions in long-sequence tasks.
Pure linear models tend to lose details when sequences become extremely long; pure sparse models see continuously expanding memory pressure as sequences grow. SALA separates computational density from information density, concentrating computational resources on important regions while ensuring global state traceability.
In other words, when processing million-token inputs, the model does not apply equal computational intensity to all tokens simultaneously; instead, it automatically allocates compute weights, significantly improving resource utilization efficiency.
Another key innovation comes from the training method HALO.
Training traditional hybrid architectures from scratch would be prohibitively expensive, as new structures need to relearn all linguistic knowledge and reasoning capabilities.
HALO's strategy is to perform structural transformation based on existing full-attention models, followed by continued training. This approach inherits the original model's capabilities while allowing the new architecture to gradually adapt to new attention patterns.
From an engineering perspective, this approach reduces computational investment to acceptable ranges, making hybrid attention models feasible for large-scale training—a practical reference value for the entire industry.
If we observe current mainstream architectures in the same coordinate system, we can see a clear evolutionary path:
Full-attention models provide stable intelligence levels; linear models provide ultimate efficiency; sparse models provide long-sequence capabilities; SALA attempts to merge all three routes into a unified solution.
Solutions like Kimi KDA, DeepSeek NSA, and InfLLM v2 are advancing long-context capabilities in different directions; SALA's emergence allows the industry to see for the first time the possibility of a unified structure achieving multiple advantageous superpositions.
Related paper references:
Sparse-Linear Hybrid Attention: https://arxiv.org/pdf/2601.22156
InfLLM v2: https://arxiv.org/pdf/2509.24663
03. MiniCPM-SALA Benchmark Results Validate Architectural Feasibility
Any architectural innovation must ultimately return to model performance.
As the first text model based on this structure to complete large-scale training, MiniCPM-SALA provides relatively clear data results across multiple dimensions.
With a parameter scale of 9B—placing it in the lightweight range—it can support million-token context inference.
The key is that KV Cache is controlled below 6GB, meaning ordinary consumer-grade GPUs can complete inference tasks. For developers, this directly lowers deployment barriers, moving long-context models from data-center-exclusive capabilities to the realm of personal hardware.
In long-text benchmarks, the model demonstrates stable advantages, particularly excelling in tasks such as cross-section information integration, long-chain reasoning, and code structure understanding.
More notably, it maintains performance levels comparable to same-scale full-attention models in conventional capability tests such as knowledge Q&A, mathematical reasoning, and code generation, showing no performance degradation—indicating that the hybrid structure does not sacrifice general intelligence.
The model also introduces the HyPE hybrid positional encoding mechanism, ensuring consistent processing capabilities for both short and long texts without performance degradation on short inputs.
Data regarding inference speed holds significant value for engineering deployment.
In cloud inference chip testing, when context length reaches 256K tokens, MiniCPM-SALA achieves inference speeds approximately 3.5 times faster than same-scale full-attention models, without using speculative sampling or additional acceleration tricks—the results come entirely from the architecture itself.
For enterprise deployment, such performance improvements mean direct cost reductions, as the number of requests processed per unit time increases significantly.
Edge-side performance is similarly impressive.
Currently, many models hit memory limits when running 256K contexts at the 8B scale, while MiniCPM-SALA completes million-context inference on consumer-grade GPUs, opening space for general models to run on phones, automotive systems, robots, and other terminal devices.
Once edge-side models possess long-context capabilities, personal assistant products can continuously read user historical data and maintain long-term memory, causing a qualitative shift in user experience—this is also considered a key metric for next-generation smart terminal competition in the industry.
04. The Battle of Architectural Routes Is Becoming a Core Variable in Large Model Competition
Early competition in the large model industry focused on parameter scale and training data volume, then shifted to inference costs and deployment efficiency; now the focus is gradually falling on underlying structural design.
Whoever can find better solutions at the architectural level can achieve higher performance under equivalent compute conditions.
The emergence of SALA also shows a new possibility: future model competition will depend not only on who has larger models, but on who has more reasonable structures.
I think this change may produce domino-like chain reactions for the industry landscape.
Changes in hardware adaptation logic:
When model memory requirements decrease, the range of deployment environment options expands; GPUs are no longer the only choice, and edge computing devices can also handle more tasks.
Furthermore, this will also bring about a reshuffling of the application ecosystem:
Once long-context capabilities mature, many applications originally dependent on databases or retrieval systems may be handled directly by models, as models can read all materials at once and generate results.
Additionally, training strategies will change; migration training methods like HALO reduce the cost of experimenting with new architectures, enabling more teams to attempt underlying innovations.
Observing from technical trends, attention mechanisms are likely entering a hybridization phase.
Single routes struggle to satisfy efficiency, accuracy, and scalability simultaneously; multi-structure collaboration will become the mainstream design direction.
Future models may dynamically switch attention modes based on tasks—using high-precision modules for complex reasoning and high-efficiency modules for large-scale scanning—with such adaptive structures becoming research priorities.
Finally, some good news: ModelBest, OpenBMB, SGLang, and NVIDIA have jointly launched a competition.
The competition is called SOAR 2026 Sparse Operator Acceleration Grand Prix, and registration is now officially open.
The event focuses on optimizing inference performance for hybrid attention architectures, with key directions including operator fusion, compilation optimization, and hardware co-scheduling, aiming to further compress resource usage and improve million-token inference speeds on consumer-grade GPUs.
It is also open to global developers; the official registration portal: https://soar.openbmb.cn/
For engineering teams focused on inference efficiency, system optimization, and model architecture, this is an opportunity to directly participate in defining the next-generation inference baseline.
The special bounty prize even reaches 280,000 RMB!
Friends who are interested should really give it a try.
Reference Reading:
GitHub: https://github.com/openbmb/minicpm
HuggingFace: https://huggingface.co/openbmb/MiniCPM-SALA
Model Scope: https://www.modelscope.cn/models/OpenBMB/MiniCPM-SALA
GitCode: https://ai.gitcode.com/OpenBMB/MiniCPM-SALA
MiniCPM-SALA Technical Report: https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf