In the AI community, the Scaling Law is almost regarded as gospel: the larger the model, the stronger the performance. However, this strength comes at a significant cost. Models with hundreds or even hundreds of billions of parameters not only drive inference costs, such as VRAM requirements and latency, into the stratosphere, but they also make environmental footprint and energy consumption impossible to ignore.
For a long time, researchers have been fixated on the most resource-heavy component of the Transformer architecture: the Feed-Forward Network, or FFN. In modern Large Language Models, the FFN accounts for over two-thirds of the parameters and more than 80% of the total compute.
Interestingly, the biological brain is highly energy-efficient, with only a small fraction of neurons active at any specific moment. Large models actually possess similar potential. In models utilizing the ReLU activation function, only a tiny subset of neurons is triggered for any given input.
This leads to a dilemma that has plagued the industry for years, often referred to as the Sparse Paradox. If theory dictates that the vast majority of computations are effectively zero, why do sparse operators frequently run slower on GPUs than dense ones?
Recently, Sakana AI and NVIDIA addressed this challenge head-on in their latest research paper, Sparser, Faster, Lighter Transformer Language Models.
The team not only demonstrated that models can achieve over 99% sparsity without performance loss, but they also designed a new data format called TwELL at the underlying CUDA kernel level. This finally translates theoretical sparsity into tangible acceleration.
1. Why is Sparsity Slow on GPUs?
To appreciate the value of this innovation, one must first understand where traditional methods fail.
The core advantage of GPUs lies in extreme parallel computing, an architecture designed for regular, dense tasks like matrix multiplication, commonly known as GEMM. Traditional sparse formats, such as ELLPACK, require recording the indices and positions of non-zero elements when processing sparse matrices.
Because hardware and software stacks have been heavily optimized for dense computation patterns, heterogeneous workloads and the overhead of materializing and managing sparse indices have been key obstacles hindering compute savings in general-purpose computing.
In gated FFNs, the structure adopted by models like Llama, the sparsity pattern is dictated by the activation values of the Gate layer. To use traditional sparse operators, the system must first run the Gate, identify non-zero elements, rearrange indices, and only then execute subsequent matrix multiplication. This rearrangement, which creates conversion overhead, often takes longer than the compute time saved. In essence, GPUs waste too much time waiting for instructions and shuffling fragmented data.
2. TwELL Format: A Puzzle Designed for GPU Tiles
To break this paradox, the authors introduced TwELL, or Tile-wise ELLPACK.
This represents a highly sophisticated engineering adaptation. Since GPUs prefer processing tasks in blocks or tiles, the approach restricts sparsity management within each tile. Instead of attempting global compression and index rearrangement for the entire matrix, TwELL independently gathers non-zero elements within each individual tile.
The core advantage of this design lies in Operator Fusion:
It allows the direct materialization into TwELL format at the end of the same CUDA Kernel executing the Gate matrix multiplication.
No global synchronization or intermediate memory reads and writes are required.
Subsequent Up and Down projection operators can be fused into the same pipeline, reading these locally aligned sparse data directly.
Put simply, TwELL sorts and packs parts directly on the production line instead of shutting down the assembly line to sort everything after production. This tile-level local operation aligns perfectly with the hardware architecture of modern NVIDIA GPUs.
3. Dual Evolution in Inference and Training
Beyond the TwELL fused operators for inference, the paper also brings major advancements to training.
During LLM training, video RAM is the primary bottleneck. Uncompressed intermediate activations consume massive space. Sparse training, however, presents a major pitfall known as Non-uniformity. Some tokens might activate 500 neurons, while others activate only 5. Reserving space for the maximum wastes memory, while reserving for the average causes overflow during high-activation events.
The authors designed a Hybrid format to solve this:
Most rows following the sparsity pattern are stored in a compact sparse matrix.
The extremely rare long-tail rows with abnormally high activations are routed to a dense backup buffer.
This scheme leverages Tensor Core dense compute for heavy lifting while using custom sparse kernels for lighter tasks, achieving a win-win for memory efficiency and processing speed.
4. Experimental Results: The Miracle of 99% Sparsity
The authors used regularization to induce sparsity in the models. The results are impressive.
From the data, several core conclusions emerge.
Scale Effect: The larger the model, the more pronounced the benefits of sparsification. The 2B model saw an inference speedup of 20.5% and a training speedup of 21.9%.
Minimal Memory Footprint: The peak training memory for the 1B model dropped from 44.5GB to 33.1GB, a reduction of 25.5%.
Performance Fidelity: After introducing slight regularization, the model's average task accuracy barely dropped. The 1B model, for instance, actually improved marginally from 44.6% to 44.7%.
We provide a quantitative study on LLM sparsity, demonstrating that simple regularization can induce over 99% sparsity with negligible impact on downstream performance.
This means developers can not only make models run faster but also train architectures on more affordable, lower-VRAM GPUs that previously could not handle them.
5. The Logic Behind Sparsity: Models Learning to Focus
Interestingly, the paper also reveals exactly where LLMs become sparse.
The authors found that sparsity is highly correlated with the information entropy of the input. For highly predictable tokens, such as common abbreviations or structural words, the model allocates very few active neurons. For tokens carrying critical context, like specific geographic locations or technical terminology, activation levels increase significantly.
Additionally, sequence position plays a role. The initial tokens typically require the most neurons to establish context; as the sequence grows, sparsity increases exponentially. This proves that sparse LLMs effectively learn to dynamically allocate compute resources, focusing processing power where it matters most.
6. Limitations and Future Directions
Of course, this technology is not without trade-offs. Currently, the kernel is highly optimized for the NVIDIA Hopper architecture, specifically leveraging new features like the Tensor Memory Accelerator. Gains may diminish on older hardware or non-NVIDIA chips. Furthermore, the choice of regularization coefficients requires careful tuning, as excessive regularization can lead to dead neuron issues.
Nevertheless, this work points the way forward. The future of Large Language Models is not just about endlessly piling on hardware power; it is moving toward more granular and efficient compute allocation.
By open-sourcing the code and kernels, the authors hope sparsity will become a standard dimension in modern model design. When we can achieve identical results with less energy, smaller memory footprints, and faster speeds, the Scaling Law will truly evolve to the next stage.