Optimization is Geometry, Geometry is Inference: Using Mathematics to End the Transformer Black Box Era

Not design, but evolution. When cross-entropy meets SGD, Bayesian reasoning becomes the only mathematical inevitability.

For a long time, the reasoning capabilities of LLMs have been viewed as an inexplicable "emergence." We witness the drop in Loss, yet it is difficult to penetrate what happens inside the parameter space.

Recently, a research team from Columbia University and Dream Sports released a trilogy of papers.

This work did not stop at experimental observations but established a complete physical picture connecting optimization objectives (Loss), internal geometry (Geometry), and reasoning functions (Inference).

It tells a complete story about how LLMs work. Its core ambition, as the title suggests, is to attempt to end the black-box era of Transformers with mathematics.

They proved: The Attention mechanism is not some approximate feature extractor, but a precise Bayesian reasoning machine that spontaneously evolves under the drive of gradient descent.

Theoretical Anchor: The Bayesian Endgame of Cross-Entropy

Transformer training is usually based on minimizing cross-entropy loss. Paper I first clarifies the mathematical endgame of this optimization process.

Paper Title:

The Bayesian Geometry of Transformer Attention

Paper Link:

https://arxiv.org/abs/2512.22471

In the limit of infinite data and capacity, minimizing cross-entropy:

Its optimal solution is mathematically strictly equivalent to the analytical Bayesian Posterior Predictive Distribution:

To verify whether finite-capacity Transformers truly approach this limit, the authors constructed "Bayesian Wind Tunnels".

This is a completely controlled mathematical environment where the analytical posterior for every step is precisely known.

Figure 1. Conceptual diagram of "Bayesian Wind Tunnel". Beyond natural language lacking Ground Truth, the authors built a controllable environment that can be precisely measured.

Experimental results show that Transformers exhibit extremely high precision in bijective learning and HMM state tracking tasks.

Figure 2. Transformer's prediction entropy precisely fits the theoretical Bayesian posterior, with an average absolute error (MAE) as low as 10^-3 bits; in contrast, MLP cannot effectively use context for hypothesis elimination.

More microscopic evidence comes from single-sequence analysis, which is ironclad proof of the model's true understanding rather than average memorization:

Figure 3. For each specific sequence, the Transformer's entropy value (solid line) can precisely track the zigzag changes of the theoretical posterior (dashed line), proving that the model is performing token-by-token real-time inference.

And in HMM tasks, the model even demonstrated perfect length generalization capability, proving that it learned a general recursive algorithm:

Figure 4. The model perfectly fits within training length K=20. When tested at lengths K=30 and K=50, the error increases smoothly without a cliff-like drop, proving the model did not rote memorize.

Geometric Representation: Three-Stage Evolution of Reasoning

Probe experiments further reveal how the Transformer internally implements this reasoning process. The authors describe it as a three-stage geometric evolution mechanism.

1. Hypothesis Framework Construction (Layer 0)

Reasoning starts with the establishment of a coordinate system. Key vectors in Layer 0 form an approximately orthogonal basis, mapping all possible hypotheses into independent geometric subspaces.

Figure 5. Cosine similarity matrix of Key vectors in Layer 0. Off-diagonal elements are close to 0, indicating the model constructed an orthogonal hypothesis space framework.

2. Progressive Hypothesis Elimination (Middle Layers)

As layers deepen, the Routing function of Attention gradually appears. The alignment degree between Query and Key shows a significant Sharpening trend.

This process is mathematically equivalent to the multiplication of likelihood functions in Bayesian updating, layer-by-layer suppressing hypotheses inconsistent with current evidence.

Figure 6. From the divergent attention of Layer 0 (left) to the highly focused attention of Layer 5 (right), demonstrating the model's gradual elimination of wrong hypotheses.

3. Entropy-Ordered Manifold (Late Layers)

When the routing structure stabilizes, Value vectors in the representation space do not collapse into discrete points, but unfold into a smooth 1D Manifold. The parametric coordinates of this manifold correspond precisely to Posterior Entropy.

Figure 7. In the late stages of training, the PCA projection of Value vectors forms a smooth curve, with low-entropy (high-confidence) states and high-entropy states geometrically ordered.

Dynamical Tracing: The Induction Mechanism of Gradient Descent

Why can standard gradient descent spontaneously produce the above geometric structures? Paper II, through the derivation of a full set of first-order gradient dynamics, found that cross-entropy loss induces an ingenious positive feedback mechanism.

Paper Title:

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Paper Link:

https://arxiv.org/abs/2512.22473

1. Advantage Routing Rule (E-step)

The gradient of the Attention Score follows the following formula:

Where ... Define Advantage.

Physical meaning: Here ... represents the error gradient direction. When ... is opposite to the error direction (i.e., more negative helps reduce Loss), the Advantage is positive.

Conclusion: Gradient descent will increase attention weights at positions that can effectively reduce Loss.

2. Responsibility-Weighted Update Rule (M-step)

The update of Value follows the following formula:

Physical meaning: The Value vector will be pulled towards the weighted average direction of the upstream error signals of all Queries attending to it, gradually evolving into the "Prototype" of that cluster of Queries.

Figure 8. Geometric interpretation of dynamics. Value moves towards error signal, optimizing Context, thereby increasing compatibility (making it more negative), forming a closed loop of cooperative evolution of routing and content.

This dynamic process is structurally equivalent to an implicit EM algorithm (Expectation-Maximization). Attention weights act as "soft responsibilities" in the E-step, while Value vectors act as "prototypes" in the M-step.

This also explains the Frame-Precision Dissociation phenomenon. The Attention structure usually stabilizes rapidly in early training, while the Value content continues to be refined on the manifold during the remaining training.

Real-World Mapping: From Superposition to Chain-of-Thought

Although the above conclusions are based on controlled environments, the authors point out in their blog [3] that similar geometric features have also been observed in production-grade models such as Pythia, Llama, and Mistral.

The key lies in Superposition: In mixed tasks, the manifold structure is often obscured by high-dimensional noise; but through Domain Restriction (e.g., focusing only on math tasks), high-dimensional representations collapse into clear entropy-ordered manifolds.

Figure 8. Conceptual diagram showing the similar manifold structures emerging within Pythia, Llama, and Mistral under specific domain tasks.

This discovery provides a clear geometric explanation for Chain-of-Thought (CoT).

For complex reasoning tasks, Transformers face the risk of running out of layers, unable to complete all necessary hypothesis eliminations within a limited number of computation steps.

CoT essentially acts as a Geometric Extender.

By generating intermediate reasoning steps, the model actually gains more computation rounds, enabling it to perform a series of short-distance, robust state transitions along the high-confidence "entropy-ordered manifold," thereby avoiding hallucinations caused by long-distance jumps in low-confidence regions.

Conclusion

This research provides a unified perspective to understand the essence of Transformer intelligence. Optimization gives rise to geometry. Geometry gives rise to inference.

The parameter matrix is not a random statistical approximation, but a Bayesian reasoning machine "sculpted" by the gradient flow on the cross-entropy potential energy surface.

From the perspective of geometric dynamics, the Attention mechanism is precisely the physical carrier of this reasoning process.

References

[1] Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra. The Bayesian Geometry of Transformer Attention. arXiv preprint arXiv:2512.22471 (2025).

[2] Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds. arXiv preprint arXiv:2512.22473 (2025).

[3] Vishal Misra. Attention Is Bayesian Inference. Medium (Dec 2025). https://medium.com/@vishalmisra/attention-is-bayesian-inference-578c25db4501

Optimization is Geometry, Geometry is Inference: Using Mathematics to End the Transformer Black Box Era

Related Articles

分享網址