Attention Is All You Need Author Returns: Can a 99% Sparse Transformer Be Even Faster?

In 2017, "Attention Is All You Need" thrust the Transformer onto the main stage of deep learning. Today, almost all mainstream large models stand on this architecture, and costs for inference, training, memory, and energy are on a continuous rise along with model scale.

When a large model runs, not all FFN hidden activations inside the Transformer are equally important. For the current token, what really contributes often accounts for only a very small part, with a large number of activations close to zero.

After adding lightweight L1 regularization, this sparsity can even reach over 99%.

Since there are so few non-zero activations, why is model speed still limited? Why might directly skipping these zero activations to save computation lead to negative optimization on a GPU?

This work, published at ICML 2026, comes from Sakana AI and NVIDIA. One of the authors, Llion Jones, is exactly one of the original authors of "Attention Is All You Need."

The paper does not introduce complex architectural modifications but focuses on FFN activation sparsity. It uses simple L1 regularization to induce high sparsity in activations, then combines a new sparse packing format and CUDA kernels to truly skip over a large number of zero activations.

Illustration of the paper's core concepts

Paper Title:

Sparser, Faster, Lighter Transformer Language Models

Paper Link:

http://arxiv.org/abs/2603.23198

Code Link:

https://github.com/SakanaAI/sparser-faster-llms

Under the premise that downstream task performance is basically unaffected, this solution achieves up to 20.5% faster forward computation and 21.9% faster training steps on billion-parameter scale models. Inference energy consumption drops synchronously, and peak memory also significantly decreases during training sparsity experiments.

This also turns sparsity that originally remained at the level of theoretical FLOPs into measurable, practical gains on modern GPUs.

Table showing inference and training speedups alongside downstream performance at different sparsity levels

Inference, training speedup, and downstream performance at different sparsity levels

Sparse Does Not Equal Faster

In larger-scale modern LLMs, the FFN often accounts for more than two-thirds of the parameters and contributes over 80% of the total FLOPs.

Diagram of up, gate, and down projections in a Gated FFN

Up, Gate, and Down Projection of the Gated FFN

The computational flow of a standard Gated FFN is typically expressed as:

ReLU, as the activation function, can naturally produce unstructured sparsity. However, the software and hardware stack of modern GPUs has long been optimized around regular, contiguous dense computation.

Traditional ELLPACK relies on whole-row packing and padding, which does not match the tiled matmul commonly used by modern GPUs.

Illustration of traditional ELLPACK row-aligned storage

Whole-Row Aligned Storage of Traditional ELLPACK

If a complete gate activation is generated first and then converted to a sparse format, this introduces extra kernel launches, global memory reads/writes, and synchronization overheads. Although the theoretical computation amount is reduced, the overhead of format conversion, index management, and memory access can easily offset the gains.

TwELL Reduces Conversion Overhead

For the inference phase, the research team designed the TwELL (Tile-wise ELLPACK) format. This format abandons global row alignment and instead splits matrix columns into local 1D data blocks (Tiles) that fit well with dense computation.

Diagram of TwELL format slicing columns into tiles for better kernel fusion

TwELL slices the column direction into tiles, making it more suitable for fusion with matrix multiplication kernels

When computing gated activations, the TwELL format can be generated directly in the operator epilogue, avoiding the launch of a separate format conversion kernel and reducing additional global memory reads and writes.

Core logic of gated projection kernel with TwELL storage generation

Core logic of gated projection with TwELL storage generation

In subsequent computations, a customized CUDA kernel can synchronously complete the up-projection and down-projection in a single pass.

Its core logic lies in fusing two multiplications, avoiding the memory access overhead for the intermediate state h:

This fusion reduces the global memory reads and writes of intermediate activations, making the theoretical benefits of sparsity land more easily on actual speed.

Hybrid Approach for Non-Uniform Sparsity

At the training stage, memory capacity becomes a key bottleneck. The number of non-zero activations varies greatly between different tokens, and a single compact format can easily be dragged down by a few rows with high non-zero counts.

The team developed a hybrid routing mechanism. Most low-activation tokens enter the highly compressed ELL matrix, while occasional high-activity tokens are dynamically diverted to a dense fallback channel and processed by Tensor Cores.

Sparse matrix operator routing logic based on a hybrid format

Sparse matrix operator routing computation logic based on a hybrid format

This design reduces dense computation and intermediate activation storage overhead during training, and also lessens the pressure of sparse training on peak memory.

Measured Gains on Billions of Tokens

In scale comparison experiments, the authors trained models from 0.5B to 2B parameters, corresponding to 10B to 40B tokens. The core regularization term used for sparse training is as follows:

Experiments show that moderate L1 regularization can reduce the average number of non-zero activations by several orders of magnitude. Under relatively conservative settings, downstream task performance essentially remains level with the dense baseline.

Task accuracy and non-zero activation counts under different L1 regularization coefficients

Multiple downstream evaluations show that in actual operation, inference speed increases by up to 30%, and memory requirements drop by over 24%.

Chart of inference forward speedup ratio and energy savings

Inference forward speedup ratio and energy savings statistics

Chart of training step speedup ratio and peak memory reduction

Training step speedup ratio and peak memory reduction statistics

Experimental data further confirms that the larger the model scale, the more obvious the throughput improvement and memory dividend brought by this sparse acceleration mechanism.

Chart comparing execution efficiency and memory consumption at different parameter scales

Execution efficiency and memory consumption comparison at different parameter scales

Computational Allocation from a Sparse Perspective

Activation sparsity also provides a window into observing a model's computational allocation. From the perspective of network depth, the first two layers are relatively silent, while the middle layers are the most active, handling core reasoning and knowledge retrieval tasks.

Distribution of non-zero activation counts across different network layers

From the perspective of token characteristics, low-activity tokens are mostly common web link fragments or highly predictable morphological segments. High-activity tokens contain verbs, nouns, place names, and substance names with stronger contextual information.

Statistics of non-zero activation counts for specific tokens at different positions in a sequence

Statistics on non-zero activation counts for specific tokens and their positions in sequences

This work does not attempt to replace the Transformer, nor does it rely on complex architectural modifications.

Its value lies in plugging FFN activation sparsity into the real GPU execution flow, using sparse formats and CUDA kernels to turn a portion of the theoretical computational savings into measurable benefits in speed, energy, and memory.

Attention Is All You Need Author Returns: Can a 99% Sparse Transformer Be Even Faster?

Related Articles

分享網址