In 1763, a formula lay hidden in the manuscripts of a deceased English pastor. 263 years later, that formula became the mathematical backbone for training GPT: Prior = Pre-training, Likelihood = Data, Posterior = Fine-tuning. Bayes' Theorem is more than just a formula — it's a way of thinking that embraces new evidence with old knowledge. And this is precisely how AI learns.
An Unfinished Paper by a Pastor
In 1761, in the English town of Tunbridge Wells, a 59-year-old Presbyterian minister passed away.
His name was Thomas Bayes.
His life was unremarkable—he preached in a small chapel, occasionally studied mathematics, and never published any significant papers. After his death, his friend Richard Price discovered an unfinished manuscript while sorting through his belongings.
After reading it, Price realized: this manuscript could change the way humanity understands the world.
In 1763, Price organized and published this posthumous manuscript in the Royal Society's Philosophical Transactions. The title was modest: "An Essay towards solving a Problem in the Doctrine of Chances".
263 years later, the core idea from this paper became one of the mathematical skeletons for all modern AI, including GPT, BERT, and Stable Diffusion.
What Bayes never imagined: the formula he derived to solve a gambling problem eventually taught machines how to learn.
One: A Counterintuitive Question
Before explaining Bayes' Theorem, let me ask you a question.
The Medical Testing Paradox
Imagine a rare disease affecting only 1 in 1,000 people (a prevalence of 0.1%).
There is a highly accurate test for it:
- If you actually have the disease, the test shows positive with a probability of 99% (sensitivity).
- If you do not have the disease, the test shows negative with a probability of 99% (specificity).
You take the test, and the result is positive.
Question: What is the probability you actually have the disease?
Most people's immediate reaction: "99%! The test is so accurate!"
Intuition tells you that you almost certainly have the disease.
But the correct answer is: roughly 9%.
You didn't misread that. Even with a 99% accurate test, a positive result only means you have about a one in ten chance of actually having the disease.
Why? Let's do the math.
Crunching the Numbers: What Happens in a Population of 10,000
10,000 people get tested
│
├── 10 people have the disease (prevalence 0.1%)
│ ├── 9.9 people → Test positive (True positive, sensitivity 99%)
│ └── 0.1 people → Test negative (Missed diagnosis)
│
└── 9,990 people do not have the disease
├── 99.9 people → Test positive (False positive, 1% error rate)
└── 9,890.1 people → Test negative (Correctly ruled out)
Total positive results = 9.9 + 99.9 = 109.8 people
Of those, who actually have the disease = 9.9 people
Probability of actually having the disease = 9.9 / 109.8 ≈ 9.0%Key Insight: Although the false positive rate is only 1%, because the number of people without the disease (9,990) is vastly greater than those with it (10), 1% of the 9,990 healthy people (≈100 people) still far exceeds the 10 people who are truly sick.
Where did your intuition go wrong?
You ignored a crucial piece of information—the disease itself is very rare (0.1% prevalence). The probability that you had the disease was already very low before you took the test. A positive test result "upgrades" this low probability, but it doesn't flip it to 99%.
This is the core problem Bayes' Theorem solves: When you get new evidence, how should you update your original beliefs?
A Bayesian breakdown of the Medical Testing Paradox: Prior × Likelihood → Posterior
Two: Bayes' Theorem – Summed Up in Three Steps
Bayes' Theorem
P(B|A) · P(A)
P(A|B) = ─────────────────
P(B)
In simpler terms:
Evidence's strength × Old belief
New belief = ───────────────────────────────
How common the evidence is
Explaining the Three Roles with an Everyday Scenario
The formula might look intimidating, but you use it every day—your brain just does the math automatically. Let me use an example to explain the three roles.
Scenario: You wake up in the morning and hear a "whooshing" sound outside your window. Is it raining?
Three roles, one story
① Prior – Before hearing the sound, how likely did you think it was raining?
You checked the weather forecast last night, which said it would be sunny. So you think to yourself: "There's maybe a 10% chance of rain."
This is the prior — your judgment based on existing knowledge, before seeing any evidence.
② Likelihood – If it was really raining, how likely is it that you would hear the "whooshing" sound?
If it's actually raining outside, the probability you'd hear a whooshing sound is high—say, 90% (it's possible the rain is too light to hear).
But note: if it's not raining outside, you might also hear a whooshing sound—a neighbor watering plants, someone washing a car upstairs—with a probability of about 20%.
Likelihood measures: If this thing were true, how reasonable is the evidence I'm seeing?
③ Posterior – After considering everything, what is the probability that it's raining?
P(whooshing|rain) × P(rain)
P(rain|whooshing) = ──────────────────────────────
P(whooshing)
0.9 × 0.1
= ─────────────────────────
0.9×0.1 + 0.2×0.9
0.09
= ────── = 33%
0.27
It went from 10% to 33% — the evidence (the whooshing sound) pulled your belief from 10% up to 33%, but it didn't get pulled to 90%. That's because your prior (the weather forecast said sunny) is pulling from the other end.
Key Intuition: The posterior is the result of a "tug-of-war" between the prior and the likelihood. If the prior is very strong (the forecast is spot-on), the evidence needs to be very strong to overturn it. If the prior is weak (you have no clue about the weather), a small bit of evidence can dominate your belief.
This is why the medical testing example is so surprising — the prior was so low (0.1%), that even with a very high likelihood (99%), the posterior was only 9%. The prior won the tug-of-war.
The Four Roles of the Bayes Formula
Let me break down each part formally:
The Four Roles of the Bayes Formula
| Symbol | Name | Medical Testing Example | Intuitive Explanation |
|---|---|---|---|
| P(A) | Prior Probability (Prior) | Disease prevalence = 0.1% | Your belief in A before seeing any evidence |
| P(B|A) | Likelihood | Having disease → testing positive = 99% | If A is true, the probability of seeing evidence B |
| P(B) | Marginal Probability (Evidence) | Overall test positivity rate ≈ 1.1% | The probability of seeing B, regardless of A being true or not |
| P(A|B) | Posterior Probability (Posterior) | Testing positive → actually having the disease ≈ 9% | Your updated belief in A, after seeing evidence B |
Validating with the medical test:
P(positive|disease) × P(disease) 0.99 × 0.001
P(disease|positive) = ────────────────────────────── = ────────────── ≈ 0.09 = 9%
P(positive) 0.011
A perfect match.
Bayesian Updating: Each Piece of Evidence "Sharpens the Focus"
The most powerful aspect of Bayes' Theorem is that it can be used repeatedly. The posterior from the last round becomes the prior for the next round—your belief becomes more and more precise, driven forward by each new piece of evidence.
Bayesian Updating: With each new piece of evidence seen, the belief distribution becomes "sharper"
The animation above shows a simple example: you have a coin and don't know if it's fair. At first, you know nothing (a flat prior), then you flip the coin and get new data each time — with every new piece of evidence, your belief distribution goes from "wide and flat" to "narrow and sharp", and you become more and more certain of the coin's true bias.
This process is like focusing a camera — at first, the picture is blurry (high uncertainty). Each piece of evidence twists the focus ring, and the picture gradually becomes clearer.
But the deeper meaning of Bayes' Theorem isn't in this calculation—it describes a way of thinking:
Take your old knowledge (prior), embrace new evidence (likelihood), and update your beliefs (posterior).
That's all Bayes' Theorem really is.
Three: Bayesian vs. Frequentist – A 260-Year War
For over two hundred years after Bayes published his paper, the world of statistics was split into two camps:
Two Views of Probability
| Frequentist | Bayesian | |
|---|---|---|
| What probability is | The frequency of an event over a large number of trials | The degree of belief in an event |
| "This coin has a 50% chance of heads" means | If you flip it infinitely many times, the proportion of heads approaches 50% | I believe heads and tails are equally possible |
| What is a parameter? | A fixed, unknown constant | A random variable that updates with evidence |
| Core method | Maximum Likelihood Estimation (MLE) | Posterior Inference |
| On prior knowledge | Rejects it — "Subjective things have no place in science" | Embraces it — "Not using prior knowledge is wasteful" |
| Key figures | Fisher, Neyman, Pearson | Bayes, Laplace, Jaynes |
This debate went on for more than two centuries. The frequentist school dominated for a long time—because a "subjective prior" didn't sound scientific enough.
But starting in the 2010s, the rise of deep learning quietly changed everything.
Because what AI does is, in essence, Bayesian updating.
Four: AI Training = Bayesian Updating
This is the most important section of this article.
Prior = Pre-training
GPT was trained on trillions of tokens of internet text. After training, its billions of weights (parameters) store "world knowledge"—grammar rules, common sense reasoning, literary allusions, scientific facts...
This knowledge is the prior — what the model already "believes" before seeing your specific question.
Post-pretraining weights = P(θ) = The prior distribution
Likelihood = New Data
When you fine-tune a model on a specific domain's data (like medical literature, legal texts, or your company's internal documents), you give the model new evidence.
Domain data = P(D|θ) = The likelihood function
The likelihood function asks: "If the model's parameters are θ, what's the probability it would generate this new data?"
Posterior = The Fine-tuned Model
The goal of fine-tuning is to find a set of parameters that allows the model to retain its general pre-training knowledge while adapting to the new domain:
P(D|θ) · P(θ)
P(θ|D) = ───────────────────
P(D)
New data's requirements for parameters × Pre-training knowledge
Fine-tuned model = ─────────────────────────────────────────────────────
Normalization constant
The Bayesian Nature of AI Training
Bayes Formula AI Training Flow ────────────────────────────────────────────────── Prior P(θ) ↔ Pre-trained weights (general knowledge from trillions of tokens) Likelihood P(D|θ) ↔ Fine-tuning data (domain/task-specific data) Posterior P(θ|D) ↔ The fine-tuned model ────────────────────────────────────────────────── Prior × Likelihood → Posterior Pre-training + Fine-tuning → Specialized Model
This isn't a metaphor. It's mathematical equivalence.
You might say, "Wait, no one is actually computing Bayes' formula during training. They use SGD (Stochastic Gradient Descent), right?"
That's correct. The actual training algorithm doesn't directly calculate the posterior distribution—the parameter space is too large for exact Bayesian inference to be computationally feasible. SGD is an approximation method. But mathematically, this approximation can be understood as a special case of Bayesian inference.
Especially when training includes regularization (L2 regularization / weight decay) —
Loss = Cross-entropy + λ Σ θi²
The probabilistic interpretation of this regularization term is precisely that it places a Gaussian prior on the parameters:
P(θ) = N(0, σ²) ∝ e−θ²/2σ²
— it's leaning towards the idea that parameters should be close to zero (a simpler model) and shouldn't be too extreme.
Regularization = Prior. When you add a penalty term to the loss function to prevent overfitting, you're essentially saying: "I believe a priori that simpler models are more likely to be correct." This is the mathematical expression of Occam's razor.
Five: In-Context Learning – A Real-Time Version of Bayes' Theorem
The most astonishing finding in the 2020 GPT-3 paper wasn't the model's size, but a phenomenon called In-Context Learning (ICL):
You don't need to fine-tune the model. Just by providing a few examples in the prompt, the model can "learn" a new task.
For example:
Input: happy → joyful Input: sad → sorrowful Input: beautiful → ? Output: gorgeous
You didn't change a single parameter in the model. Yet, it "learned" to translate.
Looking at this through a Bayesian framework is startlingly clear:
A Bayesian Explanation of In-Context Learning
Pre-training Knowledge (Prior): The model knows English and Chinese. It knows that "translation" is a possible task. It has seen many examples of translations. Examples in Prompt (Likelihood/Evidence): happy → joyful ← "This looks like a translation task" sad → sorrowful ← "And it's translating from English to Chinese" Bayesian Update (Posterior): P(task=EN-to-ZH translation | the examples I've seen) → Very high So, beautiful → gorgeous
In 2023, Xie et al., in their paper "An Explanation of In-context Learning as Implicit Bayesian Inference", rigorously proved that: when a Transformer performs In-Context Learning, its internal computation is mathematically equivalent to Bayesian inference.
With each new example it sees, the model performs an implicit Bayesian update—making the posterior probability of "what is this task?" sharper and more certain.
This is exactly what your brain does. When you walk into an unfamiliar city and see the first street sign is in English, you start to assume you might be in the US or UK. Seeing a second English sign makes the assumption stronger. By the third one—you're already very sure. You didn't "retrain" your brain, but your belief updated.
Six: Every Prediction Step in an LLM is Bayesian
Let me push this connection even further.
The process by which an LLM generates text—predicting the next token one by one—is itself a Bayesian process.
P(wt+1 | w1, w2, ..., wt)
- Prior: The linguistic rules, semantics, and world knowledge the model learned during pre-training.
- Likelihood: The contextual information provided by the previously generated tokens.
- Posterior: The probability distribution over the next token, given all the preceding context.
Each time a new token is generated, the context grows by one, adding one more piece of "evidence"—and the model's prediction of the subsequent content becomes more precise.
Text Generation = Step-by-step Bayesian Updating
[Start] The prior distribution is very "wide"—the next word could be anything. "Today" Posterior update → Highly likely to be about time/weather/an event. "Today the weather" Posterior update → Almost certainly a weather description. "Today the weather is" Posterior update → A high probability for "nice", a lower one for "bad", maybe "cold". "Today the weather is nice" ✓ The word with the highest posterior probability gets chosen.
Every single step is: Old belief (prior) + New evidence (the latest token) → Updated belief (posterior).
If you've read about "Probability in LLMs," you know the core of an LLM is predicting the probability distribution for the next word. Now you know: the mathematical essence of this probability distribution is the Bayesian posterior.
Seven: Bayes and Shannon – The Convergence of Two Hidden Lines
If you've read about "What Shannon Never Imagined" and "Information Theory: A Hidden Line from Telegraphs to GPT," you might have already sensed it—
Bayes and Shannon are talking about two sides of the same thing.
Shannon vs. Bayes: Two Sides of the Same Coin
| Shannon (Information Theory) | Bayes (Probability Theory) | |
|---|---|---|
| Core Question | How short can data be compressed? | How does evidence change belief? |
| Core Concept | Entropy H = −∑ p·log(p) | Posterior P(A|B) = P(B|A)·P(A)/P(B) |
| Training Objective | Minimize cross-entropy (compress data as well as possible) | Maximize posterior probability (find the most plausible parameters) |
| Explaining LLMs | An LLM is a compressor | An LLM is a Bayesian inference machine |
| Explaining Pre-training | Compresses the patterns of internet text | Extracts prior knowledge from data |
| Explaining Overfitting | Memorized the noise, causing compression efficiency to drop | The likelihood overwhelmed the prior, making the belief too extreme |
In fact, the mathematical derivation of the cross-entropy loss function can arrive at the same destination from two different paths:
- Shannon's Path: Minimize the KL divergence between the predicted distribution and the true distribution → Cross-entropy
- Bayes' Path: Maximize the log-likelihood of the data → The negative of cross-entropy
Minimizing Cross-entropy ≡ Maximizing Log-likelihood ≡ An approximation of Bayesian inference
In "The Cross-Entropy Loss Function," we derived -log(p) from Shannon's axioms. Now from the other side, you see the same formula — -log(p) is both the "degree of surprise" (Shannon's view) and the "force with which the data opposes the current model" (Bayes' view).
Shannon tells you "compression is understanding." Bayes tells you "updating is learning." An LLM does both simultaneously.
Eight: Our Brains Are Also Bayesian Machines
Bayes' theorem isn't just a theoretical tool for AI—more and more neuroscience research shows that the human brain also uses Bayesian inference to perceive the world.
Visual Illusions: Your Brain Doing Bayes
Have you ever seen those classic optical illusions? Two lines of the same length, with arrowheads pointing in different directions, and you perceive one as longer than the other (the Müller-Lyer illusion).
Why? Because your brain isn't "seeing"—it's inferring:
Light signals on the retina (likelihood) + Past visual experience (prior) → The image you "see" (posterior)
Based on past experience (the prior), your brain "expects" a line with outward-pointing arrowheads to be further away, and therefore longer. Even though the light signals tell you they're the same length, the strength of the prior still influences your perception.
An optical illusion is, in essence, your prior overpowering the likelihood under specific circumstances.
Language Comprehension: Also Bayesian
When you hear an ambiguous sentence—like someone saying something in a noisy bar, and you only catch 70% of it—how does your brain "fill in" the missing 30%?
Sound snippets heard (likelihood) + Linguistic knowledge and context (prior) → The sentence you understand (posterior)
This is why, in an English context, if you only hear "The weather to...", your brain is already predicting "today," "tomorrow is..." and so on.
An LLM's next-token prediction and your brain are doing the exact same thing.
Karl Friston (the proposer of the Free Energy Principle) goes even further. He posits that all brain functions—perception, action, learning, planning—can be described by a unified Bayesian framework: the brain is constantly minimizing "prediction error" (free energy), which is mathematically equivalent to Bayesian inference. This theory, called Predictive Processing, is currently one of the most influential frameworks in cognitive science.
Nine: Bayes' "Impossible" Problem – The Computational Nightmare
If Bayesian inference is so great, why not just use it directly?
Because exact Bayesian inference in high-dimensional spaces is a computational hell.
Why Exact Bayesian Inference is Infeasible
The denominator of Bayes' Theorem is:
P(D) = ∫ P(D|θ) · P(θ) dθ
This means you have to integrate over all possible combinations of parameters.
- GPT-2 has 1.5 billion parameters.
- GPT-3 has 175 billion parameters.
- GPT-4 is estimated to have over 1 trillion parameters.
Calculate an integral in 175 billion dimensions? That's an unimaginable number of orders of magnitude larger than the number of atoms in the universe.
So, the entire history of deep learning is a history of "approximated Bayesian inference":
| Method | Bayesian Explanation | Approximation Method |
|---|---|---|
| SGD (Stochastic Gradient Descent) | Finding the Maximum A Posteriori (MAP) estimate | Only finds the peak of the posterior, ignoring the distribution's shape |
| Dropout | Model averaging | Randomly dropping neurons ≈ averaging over a large number of different models |
| L2 Regularization | Gaussian prior | Assumes parameters follow a normal distribution |
| Ensemble | Posterior sampling | Trains multiple models and takes a vote |
| Variational Inference (VI) | Approximating the posterior with a simpler distribution | Turns "solving an integral" into "solving an optimization problem" |
| MCMC | Sampling from the posterior | A random walk exploring the parameter space |
Almost every "trick" you've seen in deep learning—regularization, Dropout, learning rate scheduling, Ensemble—has a Bayesian explanation.
This is no coincidence. These tricks work precisely because they approximate, to varying degrees, the correct Bayesian inference procedure.
Ten: RLHF – The Latest Incarnation of Bayesian Updating
If you've read "DeepSeek-R1: How a Model Learned to Think," you know that modern LLM training has three stages:
Pre-training → Supervised Fine-Tuning (SFT) → Reinforcement Learning from Human Feedback (RLHF)
Looking at it through a Bayesian lens:
Three-Stage Training = Three Bayesian Updates
First Update: Pre-training
Prior: Randomly initialized weights (knowing nothing)
Likelihood: Trillions of tokens of internet text
Posterior: A general language model ("can talk" but not necessarily useful)
Second Update: Supervised Fine-Tuning (SFT)
Prior: The pre-trained model
Likelihood: High-quality, human-labeled Q&A pairs
Posterior: A conversational model ("knows how to answer questions")
Third Update: RLHF
Prior: The SFT model
Likelihood: Human preference data ("this answer is better than that one")
Posterior: An aligned model ("not only can answer, but knows what a good answer is")
Every single stage tells the same story: Old knowledge (prior) + New evidence (likelihood) → An updated model (posterior).
Bayes' Theorem is like an undercurrent, flowing from a pastor's manuscript in 1763, through 263 years of statistical debate, and finally into the core of AI systems that are used billions of times every day around the world in 2026.
Eleven: Three Things Bayes Never Imagined
Back to the title. When Bayes derived his formula, he never imagined—
First: His Formula Applies to All Learning
Bayes just wanted to solve a gambling problem: knowing some observations, can you infer if a die is fair? He didn't know that the same formula could describe:
- A baby learning a language.
- A scientist testing a hypothesis.
- A doctor diagnosing a disease.
- An AI understanding the world.
Bayes' Theorem isn't a probability formula. It's a learning formula.
Second: A Prior is Not a Bias, It's Wisdom
Throughout the two hundred years of controversy surrounding Bayes, the biggest criticism was: "A prior is subjective, and therefore unscientific."
But the development of AI has proven that a prior is the most precious thing.
A model without a prior (randomly initialized) can do nothing. Pre-training is all about building up a prior. A "biased" model (one that has expectations about the world) is far stronger than an "ignorant" one.
The key isn't whether you have a prior, but whether it's a reasonable one, and whether you're willing to update it in light of new evidence.
Isn't that a life lesson as well?
Third: His Formula Would Become AI's First Principle
In 2026, when you ask ChatGPT a question:
- Its pre-training knowledge is the prior.
- Your prompt is the new evidence.
- Its reply is the posterior.
Every conversation is a Bayesian update.
An English pastor who died in 1761, with an unfinished manuscript, wrote down the first principle for the world's most powerful technology 263 years later.
He didn't know it. But the math did.
Twelve: In One Sentence
The Ultimate Lesson of Bayes' Theorem
Learning is taking what you already know, embracing the evidence you've just seen, and updating your beliefs.
That's what Bayes' Theorem says.
That's what AI does.
And that's also what you do every day.
P(New belief | New evidence) = P(New evidence | Old belief) · P(Old belief) / P(New evidence)
References and Further Reading
Original Literature• Bayes, T. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society, 53, 370-418. [The posthumous manuscript, organized and published by Richard Price]
• Laplace, P.-S. (1774). Mémoire sur la probabilité des causes par les événements. [Independently re-discovered and popularized Bayes' Theorem]
• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. [The definitive work of the Bayesian school]
Bayes in AI• Xie, S. M. et al. (2022). An Explanation of In-context Learning as Implicit Bayesian Inference. ICLR 2022. [Proved that ICL in Transformers is equivalent to Bayesian inference]
• Wilson, A. G. & Izmailov, P. (2020). Bayesian Deep Learning and a Probabilistic Perspective of Generalization. NeurIPS 2020. [A Bayesian explanation of SGD]
• Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. [The brain as a Bayesian machine]
Related Blog Posts• What Shannon Never Imagined – When Information Theory Meets Finite Computing Power — Another hidden line of information theory
• Seeing Math: Information Theory – A Hidden Line from Telegraphs to GPT — Compression = Prediction = Understanding
• The Cross-Entropy Loss Function: A Complete Derivation from -log(p) — Loss function from a Bayesian perspective
• Probability in LLMs: From Rolling Dice to Generating Text — Probability fundamentals
• Seeing Math (Part XIII): Probability—Embracing Uncertainty — The intuition of probability
• DeepSeek-R1: How a Model Learned to Think — RLHF and alignment
• Euler's e—How One Number Appears in Compound Interest, Decay, and Neural Networks — The role of e in Softmax and loss functions
• Knowledge Distillation—When a Model Learns to Apprentice — Another way of knowledge transfer