In 2017, "Attention Is All You Need" thrust the Transformer onto the main stage of deep learning. Today, almost all mainstream large models stand on this architecture, and costs for inference, training, memory, and energy are on a continuous rise along with model scale.
When a large model runs, not all FFN hidden activations inside the Transformer are equally important. For the current token, what really contributes often accounts for only a very small part, with a large number of activations close to zero.
After adding lightweight L1 regularization, this sparsity can even reach over 99%.
Since there are so few non-zero activations, why is model speed still limited? Why might directly skipping these zero activations to save computation lead to negative optimization on a GPU?
This work, published at ICML 2026, comes from Sakana AI and NVIDIA. One of the authors, Llion Jones, is exactly one of the original authors of "Attention Is All You Need."
The paper does not introduce complex architectural modifications but focuses on FFN activation sparsity. It uses simple L1 regularization to induce high sparsity in activations, then combines a new sparse packing format and CUDA kernels to truly skip over a large number of zero activations.
Paper Title:
Sparser, Faster, Lighter Transformer Language Models
Paper Link:
http://arxiv.org/abs/2603.23198
Code Link:
https://github.com/SakanaAI/sparser-faster-llms
Under the premise that downstream task performance is basically unaffected, this solution achieves up to 20.5% faster forward computation and 21.9% faster training steps on billion-parameter scale models. Inference energy consumption drops synchronously, and peak memory also significantly decreases during training sparsity experiments.
This also turns sparsity that originally remained at the level of theoretical FLOPs into measurable, practical gains on modern GPUs.
Inference, training speedup, and downstream performance at different sparsity levels
Sparse Does Not Equal Faster
In larger-scale modern LLMs, the FFN often accounts for more than two-thirds of the parameters and contributes over 80% of the total FLOPs.
Up, Gate, and Down Projection of the Gated FFN
The computational flow of a standard Gated FFN is typically expressed as:
ReLU, as the activation function, can naturally produce unstructured sparsity. However, the software and hardware stack of modern GPUs has long been optimized around regular, contiguous dense computation.
Traditional ELLPACK relies on whole-row packing and padding, which does not match the tiled matmul commonly used by modern GPUs.
Whole-Row Aligned Storage of Traditional ELLPACK
If a complete gate activation is generated first and then converted to a sparse format, this introduces extra kernel launches, global memory reads/writes, and synchronization overheads. Although the theoretical computation amount is reduced, the overhead of format conversion, index management, and memory access can easily offset the gains.
TwELL Reduces Conversion Overhead
For the inference phase, the research team designed the TwELL (Tile-wise ELLPACK) format. This format abandons global row alignment and instead splits matrix columns into local 1D data blocks (Tiles) that fit well with dense computation.
TwELL slices the column direction into tiles, making it more suitable for fusion with matrix multiplication kernels
When computing gated activations, the TwELL format can be generated directly in the operator epilogue, avoiding the launch of a separate format conversion kernel and reducing additional global memory reads and writes.
Core logic of gated projection with TwELL storage generation
In subsequent computations, a customized CUDA kernel can synchronously complete the up-projection and down-projection in a single pass.
Its core logic lies in fusing two multiplications, avoiding the memory access overhead for the intermediate state h:
This fusion reduces the global memory reads and writes of intermediate activations, making the theoretical benefits of sparsity land more easily on actual speed.
Hybrid Approach for Non-Uniform Sparsity
At the training stage, memory capacity becomes a key bottleneck. The number of non-zero activations varies greatly between different tokens, and a single compact format can easily be dragged down by a few rows with high non-zero counts.
The team developed a hybrid routing mechanism. Most low-activation tokens enter the highly compressed ELL matrix, while occasional high-activity tokens are dynamically diverted to a dense fallback channel and processed by Tensor Cores.
Sparse matrix operator routing computation logic based on a hybrid format
This design reduces dense computation and intermediate activation storage overhead during training, and also lessens the pressure of sparse training on peak memory.
Measured Gains on Billions of Tokens
In scale comparison experiments, the authors trained models from 0.5B to 2B parameters, corresponding to 10B to 40B tokens. The core regularization term used for sparse training is as follows:
Experiments show that moderate L1 regularization can reduce the average number of non-zero activations by several orders of magnitude. Under relatively conservative settings, downstream task performance essentially remains level with the dense baseline.
Task accuracy and non-zero activation counts under different L1 regularization coefficients
Multiple downstream evaluations show that in actual operation, inference speed increases by up to 30%, and memory requirements drop by over 24%.
Inference forward speedup ratio and energy savings statistics
Training step speedup ratio and peak memory reduction statistics
Experimental data further confirms that the larger the model scale, the more obvious the throughput improvement and memory dividend brought by this sparse acceleration mechanism.
Execution efficiency and memory consumption comparison at different parameter scales
Computational Allocation from a Sparse Perspective
Activation sparsity also provides a window into observing a model's computational allocation. From the perspective of network depth, the first two layers are relatively silent, while the middle layers are the most active, handling core reasoning and knowledge retrieval tasks.
Distribution of non-zero activation counts across different network layers
From the perspective of token characteristics, low-activity tokens are mostly common web link fragments or highly predictable morphological segments. High-activity tokens contain verbs, nouns, place names, and substance names with stronger contextual information.
Statistics on non-zero activation counts for specific tokens and their positions in sequences
This work does not attempt to replace the Transformer, nor does it rely on complex architectural modifications.
Its value lies in plugging FFN activation sparsity into the real GPU execution flow, using sparse formats and CUDA kernels to turn a portion of the theoretical computational savings into measurable benefits in speed, energy, and memory.