Compression Is All You Need — A Letter on Mathematics and AI from Fields Medalist Michael Freedman

In March 2026, Fields Medalist Michael Freedman published a paper just over 30 pages long, titled "Compression is all you need." Using an elegant algebraic model, he answers three ancient questions: How do humans construct mathematics? What is the essential difference between human mathematics and formal mathematics? And how will human mathematicians collaborate with AI in the future? The answer lies in a single word—compression.

In June 2017, eight people at Google Brain uploaded a paper to arXiv.

The title was as audacious as a paper can get: "Attention Is All You Need".

Nine years later, this title has become the most famous seven-word phrase in AI history. The Transformer architecture it introduced underpins ChatGPT, Gemini, Claude, DeepSeek, trillions in market value, and the anxiety of a generation.

On March 27, 2026, another paper quietly appeared on arXiv. Its title was just seven words, formatted exactly the same:

Compression is all you need: Modeling Mathematics

Seeing this title, anyone in AI would instinctively smirk—"another bandwagon jumper." Until their eyes land on the author line, and the smirk vanishes.

Michael Freedman.

This is not some ML engineer. This is a 1986 Fields Medalist, the man who proved the four-dimensional Poincaré conjecture, the soul of Microsoft's Station Q for the past two decades, one of the most consequential living mathematicians.

Is he writing about AI? Not exactly. He is telling everyone in AI: That word you've been using, "compression," is far deeper than you ever imagined.

This article is not an engineering breakthrough like "Attention Is All You Need." It is a letter—a mathematician, using a lifetime of trained intuition, answering three questions that have puzzled humanity for millennia:

How exactly do humans construct mathematical knowledge?
What is the essential difference between the mathematics humans do and formalized, "purely logical" mathematics?
How should future human mathematicians collaborate with AI?

His answer, distilled to one word: compression.

This piece today translates that letter for you.

Chapter 1: Who is Freedman?

First, let's clarify why when this man speaks, the AI world must listen.

In 1981, at age thirty, Freedman at the University of California, San Diego, solved the four-dimensional Poincaré conjecture—a problem that had hung in the balance for 77 years. The three-dimensional version earned Perelman a Fields Medal in 2006 (which he declined); dimensions five and above were solved back in the '60s. Only the fourth dimension—stuck at the most critical juncture—was conquered by Freedman.

In 1986, at the International Congress of Mathematicians in Berkeley, Freedman received his Fields Medal.

In 1997, Freedman did something few mathematicians do—he left academia. Microsoft created a department almost tailor-made for him: Station Q, with a single goal: to build a topological quantum computer through a mathematician's lens. He served as its director for twenty-five years.

In 2023, he returned to Harvard's CMSA (Center of Mathematical Sciences and Applications), adopting a new role: pondering the relationship between AI and mathematics.

So, when Freedman dropped a paper titled "Compression is all you need" in March 2026, this wasn't some trend-chasing researcher. This was a man who has spent a lifetime looking at the world from within mathematics, suddenly turning around to tell everyone:

"I've figured something out. Do you want to hear it?"

Chapter 2: An Awkward Fact for Everyone

Freedman’s paper starts with an awkward fact that is known to all in mathematics but explained by almost no one.

First, establish two concepts:

Formal Mathematics (FM): All deductions that conform to logical rules.
Human Mathematics (HM): The subset of mathematics that humans actually write down, compile, and cite.

How vast is the space of FM? Given n basic symbols, the combinations of "legitimate deductions" are exponential—once n exceeds a few hundred, it surpasses the number of atoms in the universe.

And HM? From Euclid to today, all theorems written by all mathematicians amount to roughly a few million. MathLib in Lean 4 contains about 140,000 of them.

Let's write the two numbers side-by-side.

FM: > 10⁸⁰
HM: ~ 10⁵
There are 75 orders of magnitude between them.

Human mathematics is a tiny corner, not even a speck of dust, within the universe of formal mathematics.

And more importantly—why this speck?

FM contains endless "legitimate but boring" theorems. For example: "For any integer n, n + 0 = n," "For any integer n, n + 0 + 0 = n," "For any integer n, n + 0 + 0 + 0 = n"... each one is legitimate, each one is meaningless. Human mathematicians never write these down.

For a century, there have been countless philosophical answers: "beauty," "simplicity," "usefulness," "depth"—all wordplay. None of them is a mathematical answer.

Until Freedman, in 2026, offered the first computable answer:

Because HM is the "compressible" subset of FM.

Chapter 3: Compression—Standing on Everyday Ground First

What does Freedman mean by "compression"? Don't think about mathematics yet. Think about some examples you already understand.

Example 1: Huffman Coding

Your cat is named Whiskers. The most frequent action in photos is "sleeping" (4000 times), followed by "eating" (3000), "scratching the sofa" (2000), and "spacing out" (1000).

Fixed 8-bit encoding: 80,000 bits. Huffman coding: "sleeping→0; eating→10; scratching the sofa→110; spacing out→111"—19,000 bits. A 4x compression ratio, with no information lost.

Compression exists wherever the distribution of things is uneven.

Example 2: Newton's Three Laws

Every second, the universe witnesses countless motions: an apple falling, the moon orbiting, a spring vibrating, a bullet firing, tides rising and falling... How much information would you need to record all these movements?

You don't.

You just need to remember F = m·a, plus two other laws (inertia, action-reaction), and you can regenerate all the motions above.

Newton's three laws are a few dozen characters of a program that encodes the entirety of classical mechanics.

Example 3: Zip Files

"To be or not to be, that is the question; to be"—take the recurring "to be" and "the," name them A and B, and then just write the names. This is the LZ77 algorithm (underlying zip/gzip/PNG), 1977.

Example 4: Large Language Models

Feed the entire internet to an LLM—trillions of words, millions of hours of text. After training, you get a model with hundreds of billions of parameters (hundreds of gigabytes). It can generate any content similar to its training set.

In the language of information theory, this means: An LLM is a lossy compression of the internet.

In 2023, DeepMind did something that raised eyebrows: they treated Chinchilla 70B as a general-purpose compressor, using it to compress raw byte streams—not just text, but also images and audio it had never been trained on. The result:

Text compression rate: Far superior to gzip
Image compression rate: Better than PNG
Audio compression rate: Better than FLAC

A model trained only on language could actually compress images it had never seen—because it had learned a "universal world structure."

From Huffman's character encoding to an LLM's hundreds of billions of parameters—the granularity of compression gets coarser, but the essence is the same.

Any act of "understanding" is essentially finding a shorter description.
This is not a metaphor. This is the starting point of Freedman's paper.

Chapter 4: Freedman's Model—Strings and "Macros"

The first thing Freedman says: treat mathematical deduction as a string of characters. When you write a proof on a blackboard, it is fundamentally a string. All "legitimate proof strings" lined up—that is FM.

But mathematicians never write like that. They will say: "Suppose f is continuous on [a, b], then f is uniformly continuous."

"Continuous" is a definition, expanding to about three lines of symbols. "Uniformly continuous" is another, about five lines. What appears as 20 characters fully expands to over 100. Dig further—every "short sentence" rests atop a deep tree of definitions.

Freedman gives a name to this "name → a long string" convention: macro.

"Continuous" = a macro
"Uniformly continuous" = a macro
"Integral" = a macro (calling macros for "limit," "partition," "Riemann sum")
"Lebesgue integral" = a macro (calling macros for "measure," "measurable function")
"Riemann-Lebesgue lemma" = a macro (calling all of the above)

The "full expansion" of a modern theorem often runs to hundreds of millions of characters. Yet a mathematician only ever looks at the outermost layer.

A mathematician's job is to constantly create macros.
A mathematician's entire life might just be doing one thing—spotting a pattern that no one had compressed before, and giving it a name.

Gauss named the "normal distribution." Riemann named "manifold." Galois named "group." Cantor named "set." Turing named "computability." Shannon named "entropy."

All the mathematics you learn today is built on top of the macros created by predecessors. Without layered compression, humans simply could not learn mathematics.

Chapter 5: A_n vs F_n—Two Universes

So far, this is all intuition. What Freedman does next is turn this intuition into mathematics.

He introduces two algebraic objects (don't panic, we'll use intuition):

A_n is like building with Lego

You have a pile of Lego bricks—red, blue, green. Red on blue plus green, or green first then blue then red—the final model is the same. Order doesn't matter; you only care about which bricks.

F_n is like braiding hair

Pressing the left strand first then the right, versus right then left—the resulting braid is completely different. Order dictates everything.

Freedman's theorem states something "as beautiful as magic":

Freedman's Core Algebraic Discovery

In A_n, using just O(log n) macros (logarithmic sparsity) can cause expressiveness to expand exponentially.
In F_n, even using O(n^k) macros (polynomial density), expressiveness can only expand linearly.

The same strategy of "creating macros" yields vastly different results in two universes—compressibility is structural.

Translated into plain English:

In the "Lego universe," a few macros do the work of ten thousand—bricks combine freely, and macros also combine freely with each other.
In the "braid universe," no amount of macros can save you—the order is fixed, and each combination must be learned individually.

Why is this contrast important? Because it tells us that "compressibility" is not universal; it only exists within specific structures.

In mathematics, addition, multiplication, union of sets, and function composition are all commutative or nearly commutative. Therefore, mathematics is compressible.

What about human language? The order of subject-verb-object is critical—"dog bites man" is not the same as "man bites dog." Thus, language is far less compressible than mathematics.

What about biology? DNA sequence order is paramount—which is why biology has long been descriptive, lacking succinct laws on the level of "F = m·a."

What about the parameter space of an LLM? More on that in Chapter 8.

Chapter 6: MathLib Empirical Proof—Let the Data Speak

Theory alone is not enough. Freedman did something that elevates this paper from a "philosophical essay" to "hard science": he validated the model against real human mathematics.

Test subject: MathLib—the formalized mathematical library of Lean 4, containing 140,000 theorems, covering algebra, analysis, topology, number theory, category theory...

For each theorem, three quantities were measured:

depth: nesting depth
wrapped length: number of tokens in the definition
unwrapped length: number of raw symbols when fully expanded

Result 1: unwrapped length explodes exponentially with depth.

The deeper you go, the character count after full expansion grows exponentially. At depth 10+, expanding a single theorem requires tens of millions of characters.

Result 2: wrapped length is almost constant.

But the definitions written by mathematicians, whether the depth is 2 or 12, maintain a nearly constant length—always just a few dozen tokens.

Mathematicians never write very long definitions.
Whenever something becomes complex, a mathematician's first instinct is to give it a name, and then continue using the name.

Unwrapped length explodes exponentially, while wrapped length remains unmoved—at each layer, mathematicians create a macro, pushing the complexity back down.

Result 3: The data perfectly fits A_n and severely violates F_n.

Freedman plotted the theoretical curves of both models on the same graph. A_n's exponential expansion curve fits seamlessly over the measured data. F_n's linear curve is off by several orders of magnitude.

Human mathematics lives in the compressible subspace predicted by the A_n model. This is not a metaphor; it is a measurable fact.

Chapter 7: Answers to Three Ancient Questions

Now we can return to the three questions from the beginning. Freedman's answers are each stunningly short.

Question 1: How exactly do humans construct mathematical knowledge?

Layer by layer compression.
Each generation of mathematicians looks at the achievements of the previous one, identifies the parts that "can be named," creates new macros, and then continues to deduce on top of these new macros. The entire history of mathematics is a history of macro accumulation.

Euclid named "point, line, plane" → Descartes named "coordinates" → Newton named "derivative" → Cauchy named "limit" → Cantor named "set" → Hilbert named "space" → Grothendieck named "scheme"... Each layer compressed more than the one before.

Question 2: What is the essential difference between human mathematics and formal mathematics?

Compressible vs. Incompressible.
Most theorems in FM are "legitimate but boring"—they have no structure, cannot be named, and cannot be used further. HM is that tiny corner of FM that happens to live in an A_n-like subspace.

Human mathematics is "human" precisely because our cognitive bandwidth is extremely limited—we can only operate within that compressible subspace. And the existence of that subspace is a gift from the universe—if it didn't exist, humanity would never have developed mathematics at all.

Question 3: How should future human mathematicians collaborate with AI?

AI's strength is parallel search within the vast space of FM—because it has the bandwidth we lack.
Human strength lies in judging which areas "deserve a name"—because we have fifty thousand years of training in language and abstraction.

This is not about AI replacing mathematicians, nor mathematicians training AI. It is a division of labor between two different cognitive bandwidths.

Freedman also offers a specific suggestion: run PageRank + compressibility analysis on MathLib's dependency graph. A theorem that is cited by many downstream theorems (high PageRank) and greatly compresses downstream content (high compressibility) is a core theorem—worthy of human mathematicians' investment and AI's prioritized search.

This turns "what is important mathematics" from a subjective judgment into a computable quantity.

Chapter 8: What This Means for AI

First Implication: The roadmap for AI doing mathematics becomes clear.

Since 2024: DeepMind's AlphaProof won a silver medal at the IMO; Terence Tao publicly declared Lean 4 as part of his workflow; DeepMind's FunSearch discovered new theorems in combinatorics; specialized mathematical LLMs have emerged.

Freedman's framework gives a unified explanation for all these: they search within the vast space of FM, but the areas where they succeed are precisely those already compressed by HM.

AI's mathematical capability stands on the shoulders of humanity's two-thousand-year history of "creating macros."
Without the 140,000 theorems in MathLib, AI searching in pure FM is like looking for a grain of rice in the Sahara Desert.

The next breakthrough will not come from making AI search FM faster—but from teaching AI to "create its own macros."

Second Implication: The answer to what an LLM is, becomes clearer.

DeepMind's 2023 paper "Language Modeling Is Compression" gave the first layer of the answer: next-token prediction = maximizing compression rate under arithmetic coding. The cross-entropy loss during training is, strictly speaking, the negative logarithm of the "compression rate on the training set." Lower loss means higher compression, means deeper understanding—not a metaphor, but a mathematical identity.

But Freedman gives the second layer: LLMs use macros, but they don't create them.

During training, an LLM consumes the entire internet—which is filled with humanity's two thousand years of created macros ("calculus," "evolution," "democracy," "entropy," "attention"...). The LLM learns to move fluidly between these macros—hence its stunning performance on "single-step reasoning."

But on "long proofs"—it collapses. A proof that requires creating a new macro is very difficult for an LLM to complete reliably. Because it has never seen this macro in training, it cannot define a new concept from scratch and then continue to deduce on top of it.

This is exactly the "layer" in Freedman's "layered compression"—each layer is a new act of naming. LLMs are stunning within a single layer, but break across layers.

Third Implication: Why LLM scaling might have an upper bound.

If intelligence is essentially "layered compression"—creating macros, and then macros upon macros—then simply making the model larger increases the bandwidth within a single layer, not the number of layers.

A larger LLM can use finer macros, a larger vocabulary, and a longer context. But its ability to create new macros does not undergo a qualitative change simply by becoming larger.

An LLM is a macro-user. True intelligence is a macro-generator.

— This echoes the debate in the "World Model Wars" between LeCun/Li Fei-Fei vs. Ilya. Ilya argues that the ceiling for "using macros" hasn't been reached, while LeCun argues that the ability to "create macros" hasn't even begun.

Chapter 8.5: Beyond Mathematics—Poetry, Painting, and Music are Also Compression

Freedman's paper talks about mathematics from beginning to end. But if "compression is understanding" is truly a cosmic-level fact, it should not only hold true in mathematics.

As I wrote this, a line by Wang Wei jumped into my head.

A vast desert, a single column of smoke rises straight; A long river, the setting sun is round.

Ten characters. No ornaments, no adjectives, not a single "emotional" word. Yet after reading these ten characters, an image instantly surfaces in your mind—vast, empty, a lone straight wisp of smoke, a perfectly round sun pressing against the horizon. Immediately after, a feeling of desolate bleakness and solitude arises, one you can't articulate but definitely feel.

How much information is hidden behind these ten characters? Visually, a complete panorama of the northwestern frontier; geometrically, the minimalist compositional contrast of "straight" and "round," one vertical and one circle holding up the entire space; in time, the instant of sunset, the day's end; in mood, the loneliness of an envoy traveling far from home, the melancholy of leaving familiar land; and as background, the entire symbolic system of High Tang frontier poetry. To recount this in prose would take thousands of words, yet still be insufficient. Wang Wei used ten characters to compress it all into a seed that can unfurl again in your mind.

This is exactly the same "macro" discussed in Freedman's paper. "Vast desert," "solitary smoke," "long river," "setting sun"—each one is a macro—it calls upon two millennia of accumulated imagery, visuals, and emotions from Chinese literature. Wang Wei's genius isn't "writing beautifully," but selecting those four macros that maximize information content when unfurled, and placing them together.

Music is another face. The opening of Beethoven's Fifth Symphony has just four notes: ta-ta-ta-tum. But these four notes are deformed, recombined, ascended, descended, and inverted hundreds of times throughout the symphony. A forty-minute symphony is essentially pressed out of a four-note motif—this is what composers call "theme and variations." In Freedman's words: create a macro, and then freely expand within the macro's space.

Painting, too. Qi Baishi paints shrimp: no water, no weeds, just the shrimp—what you see are the shrimp, but what you feel is the entire pond. The blank space is not "unpainted"; it allows the viewer to unfurl that vast information within their own mind. A single bird rolling its eyes by Bada Shanren, and you read the entire mindset of a Ming dynasty loyalist living under a new dynasty.

Why do all arts point to the same thing? My conjecture is as follows:

The human brain can only hold a limited number of "dimensions" at once. An attention span composed of a few thousand brain cells can, at any given moment, only make associations within a relatively low-dimensional space.

So we specialize—some focus on finding compressible structures in the dimension of mathematics (geometry, groups, manifolds), others in the dimension of language (imagery, rhythm, puns), others in the dimension of sound (harmony, tonality, motifs), and others in the visual dimension (composition, proportion, negative space). Not because these fields are unrelated, but because one person cannot shoulder all the dimensions. We use the channel to which we are innately most sensitive to compress the world, so that disciplines remain mutually incomprehensible—the barrier is not between the disciplines, but within our own cognitive bandwidth.

And the LLM, for the first time, provides a physical foundation for "connecting these dimensions."

A model with hundreds of billions of parameters has an internal representation space with a dimensionality far exceeding what any single human individual can call upon simultaneously. Thus, many things that seem "unrelated" to us—a Song dynasty poem, a Bach fugue, a partial differential equation, an ink wash painting—begin to exhibit directions of alignment within that high-dimensional space.

The emergence of LLMs is not mystical metaphysics; it is this: When the dimensionality of compression becomes large enough, macros originally scattered across different disciplines begin to call upon each other. The macro "entropy" suddenly becomes the same thing in physics, information theory, economics, and psychology. The macro "symmetry" suddenly becomes the same thing in group theory, crystals, music, and poetry. This is probably what cross-domain generalization is, the rudiment of the so-called "world model."

So, mathematics, poetry, painting, and music are not four different things. They are projections of the same thing onto four different media.

Wang Wei was not "just a poet"; he was someone finding compressible structures in the dimension of language. Euler was not "just a mathematician"; he was someone finding compressible structures in the dimension of symbols. Beethoven was not "just a composer"; he was someone finding compressible structures in the dimension of time. Qi Baishi was not "just a painter"; he was someone finding compressible structures in the visual dimension.

Different paths, same destination. All things are one.

Every ordinary person among us, in our own most sensitive channel, is doing the same thing—compressing the complex world into a short description we can hold onto, and then living by that short description.

Freedman used an algebraic model to prove: mathematics exists because it lives in an A_n-like compressible subspace. I want to add something he didn't say: Human civilization exists because it lives in the union of countless compressible subspaces. Mathematics is just the cleanest one among them, but not the only one.

Chapter 9: The Convergence of Four Views of Probability

Writing up to this point, I can't help but look back at the path this blog has taken over the past year.

A main thread runs through four articles—each looking at the same mathematical object P(x) from a different angle:

Perspective	What P(x) is	Core Argument	Key Figures
Bayesian	Belief	Update upon evidence	Bayes / Jaynes
Entropy	Ignorance	Entropy is a measure of ignorance	Boltzmann / Shannon
Quantum QBism	Reality	Probability is the state of the world itself	Born / Fuchs
Compression (this article)	Understanding	-log P is the description length	Shannon / Freedman

All four perspectives point to the same formula:

L(x) = − log P(x)

Bayesian camp: L(x) is "surprise," driving belief updates.
Statistical mechanics camp: L(x) is the contribution of a microstate to entropy.
QBism camp: L(x) is the weight of a measurement outcome on the next bet.
Compression camp: L(x) is the number of characters this event takes up in the optimal encoding.

They are the same mathematical object, viewed from four different philosophical positions.

The significance of Freedman's paper is this—he upgrades this formula from "an information-theoretic tool" to "the very foundation of mathematics itself." Mathematics can exist because the universe is compressible; humans can do mathematics because we live in a low-description-length structure like A_n.

Chapter 10: Three Intuitions Left by Compression

One: All "understanding" is compression.

You understand a phenomenon when you can regenerate it with a description far shorter than the original data. If you can do this, you understand it; if you can't, you are just memorizing it.

Two: What's unique about mathematics is its capacity for "nested compression."

Not just a single act of compression, but "compression upon compression." Each generation of mathematicians packages the previous generation's results into a name, then continues working on that name. This recursive process is something other disciplines lack (or lack to this degree).

Three: Mathematics, poetry, painting, and music are projections of the same thing onto four different media.

Masters in every field are excavators of compressible subspaces within their own channels. Wang Wei's "Vast desert, a single column of smoke rises straight; A long river, the setting sun is round" and Euler's e^(iπ)+1=0 are essentially isomorphic—both compress massive amounts of information into a seed that can unfold in another person's mind. We specialize not because the world is fragmented, but because a single person's cognitive bandwidth is insufficient. LLMs, for the first time, allow the macros from these specialized fields to begin calling upon each other within the same high-dimensional space—this is what is meant by emergence and generalization.

Four: For AI to do true mathematics (and deep intellectual tasks), it must learn to "create macros" and not just "use macros."

"Using macros" is an engineering problem—expanding context, improving accuracy, stacking more layers. "Creating macros" is a cognitive problem—discerning a pattern from chaos that can be given a name.

All current LLM scaling is happening at the level of "using macros." The real breakthrough—whether it's called AGI, JEPA, a world model, or something else—will occur the day AI starts creating its own macros.

Epilogue: You Reading This Article is Compressing

Freedman probably spent a year writing that paper. I spent about eight hours writing this piece, including researching and making graphics. You read it in about twenty minutes.

One year → eight hours → twenty minutes.

With every compression, there is loss. But with every compression, there is also gain—you can take away a new way of seeing the world in just twenty minutes.

A few days after reading, you'll probably only remember a few key terms: compression, macro, Lego and braids, MathLib, create macros, not use macros.

This is yet another compression.

If these key terms can still be called upon later when you encounter other problems—learning a new field, reading a paper, training your own model, mentoring a student, or even just thinking about something—then it means they have become new macros in your brain.

You are also doing what Freedman described.

Mathematicians, programmers, writers, teachers, students—all who "work with their minds" are doing the same thing every day: compressing the world's complexity into a usable, short name.

The next time someone asks you "What is intelligence?"—you can answer differently.

It's not "processing information." Not "pattern recognition." Not "deep learning." It is:

Finding a shorter description.
— Compression is all you need.

Next time, we return to the final stop of the "Seeing Physics" series—Symmetry. Noether's theorem, Yang Chen-Ning, the skeleton of the universe. Symmetry and compression are twin sisters—where there is symmetry, there is conservation; where there is conservation, there is a compressible description.

— So, we are actually still in the same story.

This article was first published on the "AI Learning Notes" blog: https://Jason-Azure.github.io/ai-blog/posts/compression-is-all-you-need/
WeChat Public Account: AI-lab学习笔记
Reference: Freedman, Compression is all you need: Modeling Mathematics, arXiv 2603.20396 (2026-03)