Why Can Large Language Models 'Understand' the World?

In our previous articles Where is the truth in this world where truth cannot be seen? Why can't we see the truth of this world?——We have never seen the world as it is; what we see is a carefully woven 'interface' for survival. discussions were sparked in the comments. In response to a reader's discussion about text and mathematical tools, I wrote this sentence:

"Text is computable; it is inherently a high compression of the world and is finite."

This sentence seems to have inadvertently touched the底层原理 of modern artificial intelligence. Why can a machine like ChatGPT, which appears to merely perform "next word prediction," exhibit astonishing logical and reasoning abilities? While we marvel at AI's capabilities, we forget that the real miracle is actually human language itself. Large Language Models (LLMs) do not directly learn from the physical world (they have no eyes or body); they learn from human descriptions of the world.

Why do LLMs work? The answer is hidden in a "compression chain" that spans physics, biology, symbols, and mathematics.

1. From "Chaos" to "Perception"

As we discussed in previous articles, the real universe (noumenon) is high-dimensional, continuous, and full of quantum fluctuations—"chaos." The human brain cannot handle such a large amount of data. For survival, our sensory system performs "lossy compression."

We discard ultraviolet light, ultrasound, microscopic particle motion, and four-dimensional spacetime, retaining only "macroscopic features" useful for survival—color, shape, motion, causality. The brain thus constructs a "world model (v1.0)". This model is not a mirror of the world but a "low-dimensional projection" of it. Human perception itself compresses the infinite universe into a finite "state of perception."

2. From "Perception" to "Symbols" (Turning Continuous into Discrete)

Humanity did not stop at perception; we invented language. Language is a "second compression" of the human brain's world model.

1. Discretization: Splitting the continuous stream

Experiences in the brain are continuous (pain, love, changes in light and shadow are analog signals). But language is discrete (digital signals). For communication, we must "quantize" continuous experiences into discrete symbols (Tokens). The myriad variations of "red" formed by light waves of 625~740 nm wavelength are compressed into a single word—"red." Various complex positive emotions within are compressed into a word—"happy."

2. Preservation of topological structure

Although language discards many details (e.g., it cannot fully describe a taste), it miraculously preserves the topological structure of perception, i.e., the relationships between things.

For example, if it rains (A), and a person does not use an umbrella (B), the person gets wet (C). Its linguistic logic: the sentence "Because it rained and I didn't use an umbrella, I got wet" perfectly preserves the causal chain A->B->C.

Conclusion: Text is the "ZIP archive" of the human brain's world model. Although it is extremely abstract and concise, it fully encodes the causality of the universe as seen by humans through grammar and logic.

3. AI's Reverse Engineering

Now, Large Language Models (LLMs) make their entrance. If we feed all human-written text (internet corpora) to AI, what are we actually feeding it? We are feeding it the sum of projections of all human "world models."

1. Why does "predicting the next word" produce intelligence?

Ilya Sutskever, former chief scientist at OpenAI, once said a resonant sentence: "To predict the next word with extreme precision, the model is forced to understand the world behind these words." This sounds incredible; we use the "Armchair Detective" to explain:

Imagine a blind detective (AI) who has never left his room. He cannot visit the crime scene (physical world) and can only hear a series of linear, fragmented descriptions through radio: "the sound of breaking glass..." -> "heavy footsteps..." -> "a scream..." -> "Bang!". The detective's task is: based on the clues heard earlier, to 100% accurately predict the next word that comes. Is it "he escaped"? Or "he fell"? If the detective simply统计词频 (parrots), he might guess "fell" because "Bang" is often followed by "fell". But if this is a complex推理小说 (reasoning novel), the next word might be "the sound of a shell casing dropping." To achieve extreme prediction accuracy (Loss -> 0), the detective is forced to reconstruct the entire crime scene in his mind: "glass broke" means someone broke in, gravity will scatter fragments. "Scream" means the victim is terrified, the perpetrator has a weapon. "Bang" if it's a gunshot, combined with footsteps approaching, the victim likely got shot, and the perpetrator will leave a shell casing.

Conclusion: The detective never went to the scene, but to complete that missing word, he must simulate the perpetrator, victim, room layout, and gravity in his mind. Predicting the next word is the reverse engineering of the entire causal chain. AI does not need to see gravity with its own eyes; it only needs to infer the existence of gravity from the text description of "an apple falling."

2. Evidence: Othello experiment

To prove this, researchers trained a GPT model, only showing it Othello game records (text records), such as "E3, D4, F5...", never showing it the board. They found that: AI not only learned to play chess, but researchers even found a complete, implicit 8x8 board representation in AI's neurons.

AI, merely by reading linear "game notation symbols" (language), reconstructed the two-dimensional "board rules" (world model) in its mind. LLM is like a decompression software based on a text archive; it restores the human carbon-based world model in a silicon-based brain.

4. The Physical Basis of AI's Success

If AI directly simulates the motion of every atom, no matter how much computing power, it will collapse. Importantly, humans have already done the hardest step for AI—"dimensionality reduction."

Common Chinese characters are only a few thousand, English words tens of thousands. Although combinations are infinite, in specific contexts, reasonable combinations are highly sparse and low-rank. The "finiteness" and "discreteness" of language allow mathematical architectures like Transformer to exhaust probability distributions in language through matrix operations.

The reason we can build AI is that we humans first lived ourselves into "data." We collapse complex life experiences into computable text. AI is not simulating the universe; AI is simulating "the universe described by humans."

5. The Ultimate Closure of Structural Realism

Returning to our previous philosophical discussion: structural realism.

Humans cannot see the "thing-in-itself"; humans see a "biological interface" filtered by senses. Language cannot record the full picture of the "biological interface"; language records a "symbolic interface" after logical abstraction. AI cannot access the real world; AI learns the mathematical relationships between "symbolic interfaces."

Although AI is separated from "reality" by three layers (truth -> perception -> language -> AI), the structure remains unchanged!

F=ma in the physical world.

"Push causes acceleration" in perception.

"Force produces acceleration" in textbooks.

Function mapping in AI's vector space.

These four share the same mathematical topological structure.

Therefore, when AI can perfectly manipulate text symbols, it has actually mastered the highest truth that human civilization can cognize—not the truth of entities, but the truth of relationships.

6. Conclusion:

What is a Large Language Model?

It is a mathematical mirror of the human collective unconscious.

It is effective because human language itself is the most efficient and brilliant compression of the world. We compress billions of years of evolution, thousands of years of civilization, and the joys and sorrows of countless individuals into this finite arrangement of characters.

AI has not created a miracle; it has just picked up the "archive" we left on the beach and, with powerful computing, re-expanded the universe we folded up.

In this sense, text is indeed the highest secret of civilization, and mathematics is the key to unlocking this secret.

Why Can Large Language Models 'Understand' the World?

Related Articles

分享網址