MLNLP Community is a well-known domestic and international community for machine learning and natural language processing, with an audience covering domestic and international NLP doctoral students, university teachers, and corporate researchers.
The community's vision is to promote exchanges and progress among academic, industrial, and enthusiast circles in NLP and machine learning, especially for beginners.
Source | PaperWeekly
Since 2017, Self-Attention has almost become the absolute cornerstone of modern sequence modeling.
We have long been accustomed to lifting sequence hidden states to a massive paired interaction space by computing QK^T. Although this paradigm is brute-force and effective, it comes at a high cost. It not only brings quadratic computational overhead but also creates a high-dimensional, dense tensor cloud that is difficult to parse.
Recently, a paper titled "Attention Is Not What You Need" directly challenged this assumption.
The authors did not follow the old paths of Mamba or RWKV (i.e.,切入 from the perspective of RNN/SSM time series), but instead blazed a new trail, proposing a novel perspective based on differential geometry.
If we view inference as a geometric evolution on a semantic manifold, then what we truly need is not attention weights, but an evolution mechanism capable of capturing local geometric structures (such as subspace changes).
This is the Causal Grassmann Transformer. It does not compute global Attention; instead, it maps token pairs to points on the Grassmann manifold Gr(k, d) (i.e., subspaces), utilizes Plücker coordinates for feature encoding, and achieves sequence mixing completely free of attention.
Paper Title: Attention Is Not What You Need: Grassmann Flows as an Attention-Free Alternative for Sequence Modeling
Paper Link: https://arxiv.org/pdf/2512.19428
Research Background
To understand the innovation of Grassmann Flow, one must first understand what the core operator of Transformer means mathematically.
In standard Transformer, the multi-head attention mechanism computes Q, K, V via linear projection, followed by building the attention matrix: Attention(Q, K, V) = softmax(QK^T / sqrt(d)) V. The authors pinpointed this process as Tensor Lifting. This is akin to researching the relationships between N points by violently transitioning to an interaction tensor space of dimension N^2 x d.
Although this lifting grants the model extremely high degrees of freedom, it brings two fatal drawbacks:
Black-box parsing: After spanning multiple layers and heads, the model is actually manipulating an extremely complex tensor cloud. Due to the high degree of freedom, we simply cannot find a set of concise mathematical invariants to describe the model's global behavior.
Complexity shackles: The O(N^2) computational cost is unsustainable for long sequences.
The authors proposed a highly philosophical hypothesis: the unexplainability of large models stems not only from parameter counts but from establishing core operators on untraceable high-dimensional tensor liftings. If we restrict the mixing mechanism to a finite-dimensional manifold with a clear structure, we might achieve both expressiveness and interpretability.
Grassmann Manifold and Plücker Embedding
The core idea of Causal Grassmann Transformer is very elegant: replacing weighted summation with subspace evolution. The model no longer computes global token similarities but captures the geometric features of the linear subspaces formed by tokens within local windows.
The architecture mainly consists of the following four steps:
1. Linear Dimension Reduction
First, to control computational load and extract core semantic directions, the model projects the high-dimensional hidden state h_i into a low-dimensional space R^k (k=7 in experiments): h_i^proj = W_down * h_i.
This step not only reduces the overhead of subsequent geometric calculations but also implicitly approximates the local tangent space of the semantic manifold.
2. Local Pairing and Grassmann Manifold
This is the hardest core of this paper. The model defines a set of multi-scale windows W (e.g., [2, 3, 4]). Note that to ensure autoregressive properties, strict Causal Pairing is used here. For position i, it only pairs with future j (or historical i, depending on the index perspective), never peeking at the future.
The model examines the 2-dimensional linear subspace spanned by h_i^proj and h_j^proj. Mathematically, all 2-dimensional subspaces in the R^d space constitute the Grassmann manifold Gr(2, d). This means the model treats the "token pair" as a point on the manifold, rather than two independent vectors.
3. Plücker Coordinate Embedding
How to handle points on a manifold in a neural network? The authors utilized Plücker Embedding from algebraic geometry. For a pair of vectors u and v, the Plücker coordinates p consist of all possible 2x2 sub-determinants: p_{kl} = u_k v_l - u_l v_k.
This vector uniquely determines the subspace (up to scalar multiplication). The geometric intuition here is wonderful: Plücker coordinates essentially encode the projection areas of the parallelogram formed by two vectors on various basis planes. It no longer cares about how far apart the two vectors are (distance), but focuses on their "opening posture" (Relative Pose). This is a more essential geometric feature than the dot product.
4. Projection and Gating Fusion
Finally, the extracted geometric features are projected back to the model dimension d and injected into the backbone flow via a gating mechanism: h_i = h_i + Gating(W_up * Geom(h_i, h_j)).
5. Complexity Analysis
The computational complexity of the entire process is O(N * k^2 * |W|) (assuming |W| is constant), which is linear with sequence length.
In contrast, standard Attention has a complexity of O(N^2 * d). Although the dimension of Plücker coordinates P(2, d) grows rapidly with d, when d=4096, the feature dimension is only 496, which is well within acceptable limits.
Experimental Results
The authors evaluated model performance on Wikitext-2 (Language Modeling) and SNLI (Natural Language Inference). The experimental design was very honest, directly comparing Transformer baselines with similar parameter counts.
1. Language Modeling (Wikitext-2)
In tasks heavily relying on global context, such as language modeling, GrassmannLM showed competitiveness but failed to surpass Transformer.
Table 1-2. Comparison of PPL for TransformerLM and GrassmannLM at different layer counts.
Results show that GrassmannLM's Perplexity (PPL) is about 10-15% higher than Transformer's. In the industry, a 15% PPL gap usually means unavailability. However, this is a prototype that completely discarded Attention. As the number of layers deepened (from 6 to 12), the gap narrowed slightly, proving that stacking geometric flows can indeed approximate complex global interactions.
2. Natural Language Inference (SNLI)
However, in the SNLI task requiring logical reasoning, a reversal occurred. When fixing DistilBERT as the backbone and only replacing the classification head:
Table 3. Grassmann-Plücker head slightly outperforms in inference tasks.
The Grassmann-Plücker head slightly outperformed the Transformer head in accuracy. This suggests that when dealing with logical relationships like entailment and contradiction, explicit subspace geometric features may contain richer semantic structural information than simple attention weights.
3. Actual Running Speed
Although the theoretical complexity is linear, the authors admitted that due to the current implementation relying on basic PyTorch operations (especially Plücker coordinate calculations involving extensive slicing and rearrangement), lacking extremely optimized CUDA kernels like FlashAttention, its actual training speed is slower than the optimized Transformer. This once again confirms that in the field of deep learning, systems engineering optimization is equally important as algorithmic theoretical innovation.
Conclusion
This paper does not announce the end of Transformer, but rather an enlightening decentralized attempt. It proves that as long as the model is endowed with sufficiently rich geometric evolution rules, sequence modeling capabilities can be achieved with competitive performance, even by completely discarding attention weights.
A deeper implication lies in Inductive Bias. Transformer is an architecture with extremely weak inductive bias (fully connected graph); it relies on massive data and computing power to brute-force its way to intelligence. In contrast, Grassmann Flow goes the opposite way, introducing explicit geometric inductive bias. While we are frantically rolling parameter counts and context lengths, should we stop and think about whether the essence of intelligence is brute-force statistics or elegant evolution on a manifold?
This might be a signal—don't forget the infinite possibilities inherent in mathematics itself.