The Ghost of Markov: From Predicting the Next Word to Predicting the Next Action

In 1913, Russian mathematician Andrey Markov opened Pushkin's "Eugene Onegin" and began counting vowels and consonants. He wasn't conducting literary research; he was dismantling an old world: the next symbol does not appear out of nowhere; it is pulled by the previous state. Over a century later, LLMs are trained to predict the next token, and Agents are trained to choose the next action. This goal seems mundane, yet it compresses language, knowledge, reasoning, and action into an extremely dense supervisory signal. Markov did not envision ChatGPT, but he left behind a minimal grammar for modern AI: state, transition, next step.

A Mathematician Opened a Book of Poems

1913, St. Petersburg.

A 57-year-old Russian mathematician opened Pushkin's "Eugene Onegin."

He didn't come to read poetry.

He picked up a pen and started counting.

This letter is a vowel, note it down. The next is a consonant, note it down. The next is a consonant, the next is a vowel. He counted over twenty thousand letters and ended up with a string of symbols that seemed utterly unpoetic:

vowel, consonant, consonant, vowel, consonant, vowel, vowel, consonant...

If anyone had been watching, they probably would have thought he was mad.

One of the world's greatest works of Russian poetry had been reduced, in his hands, to a string of Vs and Cs.

But what he truly cared about wasn't Pushkin.

He cared about a deeper question:

Is there a relationship between the next symbol and the previous one?

Before that, probability theory was most familiar with coin flips, dice rolls, and drawing balls from an urn. These problems share a common assumption: each trial is independent of the others. The previous coin landing on heads does not change the probability of the next coin toss.

But language isn't like that.

If you just saw a consonant, the probability of the next letter being a vowel increases. If you just saw a vowel, the probability of the next letter also being a vowel probably decreases. Letters do not appear in isolation; they pull on each other.

This mathematician was named Andrey Markov.

He had no idea that, over a century later, humanity would build a machine that does something similar at every step:

Seeing the preceding sequence, predicting the next token.

Even less could he imagine that, one step further, this machine would start using tools, modifying code, browsing the web, and executing commands.

From predicting the next letter, to predicting the next word.

From predicting the next word, to predicting the next action.

Markov did not invent the large language model, nor did he invent Agents.

But his ghost haunts many corners of modern AI.

1. The Old World: Each Time is Unrelated

First, think of a coin.

You flip it ten times. The first nine times are all heads. What is the probability the tenth time is heads?

Still 50%.

This is an independent event.

What happened in the past does not affect the next time.

Coin flip: P(10th flip = heads | first 9 flips are heads) = 50% P(10th flip = heads) = 50% The past did not change the future.

Independence is beautiful.

Because it makes the world easy to calculate. Every trial is like starting over; you don't have to carry the past on your back. Many early theorems in probability theory were built upon this clean assumption.

But the real world is rarely this clean.

If it rains today, the probability of rain tomorrow increases.

If you just said "I love," the probability of the next word being "you" skyrockets.

If the stock market crashes today, the sentiment at tomorrow's opening won't be as if nothing happened.

If a program just failed to compile, the most logical next step isn't "continue writing new features," but rather "read the error log."

The real world is not a coin.

The real world has memory.

Markov's revolution began right here.

He didn't ask whether the future is entirely determined by the past. That's a question of fatalism.

He asked a more mathematical, more computable question:

Can the future be predicted solely by the current state?

This sentence sounds simple, but it changed the direction of probability theory.

2. Markov Chains: Compressing the Past into the Present

The core of a Markov chain is just one sentence:

The next step depends only on the current state, and not directly on older history.

Written as a formula, it is:

P(X_{t+1} | X_t, X_{t-1}, X_{t-2}, ...) = P(X_{t+1} | X_t)

In other words:

As long as you know "where you are now," you don't need to memorize the entire history.

This is not to say the past is unimportant.

Of course, the past is important.

It's just that the influence of the past has already been absorbed by the "current state."

Imagine you are navigating.

You start from home, pass three roads, bypass two traffic lights, and finally stand at an intersection. Now you must decide the next direction.

For the navigation app, the most important thing isn't the roundabout route you just took, but:

Which intersection are you at now? Which direction are you facing? Where does each road lead?

Your history has been compressed into your "current location."

This is the intuition of the Markov property.

Long history: Home → Community gate → Subway station → Mall → Intersection A Compressed into a state: Current state = Intersection A Next step: P(turn left | Intersection A) P(turn right | Intersection A) P(go straight | Intersection A)

A Markov chain is composed of two things: states and transition probabilities. The state is where you are now (e.g., sunny, rainy; letter A; a webpage; the current state of a code repository). The transition probability is the chance of moving from one state to the next (e.g., sunny to rainy 20%; 'A' followed by 'B' 5%).

This is the most minimal sequence world:

State → State → State → State | | | | Transition Transition Transition Transition

If Shannon asked "How much information is in this string of symbols?", and Bayes asked "How does belief update after seeing evidence?", then Markov asked:

Standing in the present, what is the most likely next step?

This is the primal question of language models.

3. Language is a Markovian River

You read this half-sentence:

Today the weather is very

What might the next word be?

nice hot cold terrible good

You wouldn't guess "quantum," "tomato," or "file system."

Because the preceding words have already narrowed down the probability distribution.

Language is not a random bag of words. Language is a river with a current.

One word pushes the next; one sentence pushes the next.

Early language models worked exactly like this.

The simplest is the bigram:

P(next word | current word)

For example:

P("you" | "I love") is very high. P("tomato" | "I love") is possible. P("tensor parallelism" | "I love") is very low.

A step further is the trigram:

P(next word | previous two words)

And even further is the n-gram:

P(next word | previous n-1 words)

This was the core tool of early natural language processing.

It is very Markovian: only look at the history within a limited window, and use this window to predict the next step.

Its problems are also obvious. The window is too short; it cannot remember distant information.

Xiao Ming put the borrowed book, with its faded cover and an old train ticket tucked inside, on the table yesterday. This morning, he discovered it was gone.

What does "it" refer to? It refers to "the book." But "the book" is very far from "it." A short-window n-gram has likely already forgotten.

What about lengthening the window? The number of states explodes. With a vocabulary size of 100,000, a bigram has 100,000 states, a trigram has 100,000^2, and a 10-gram has 100,000^9 state configurations.

This is the first wall Markov chains hit in language:

If the state is too thin, you forget. If the state is too thick, it explodes.

Later deep learning has essentially been trying to solve this problem continuously.

RNNs tried to compress history into a hidden state. LSTMs added gating mechanisms to this hidden state to prevent forgetting too quickly. The Transformer simply spread a very long history out flat and used Attention to allow every position to look back.

They are all doing the same thing:

Compressing the past into a sufficiently useful present.

This sentence is where Markov's ghost first truly manifested itself.

4. Why Not Predict the Next Sentence?

Here arises a very natural question.

Since humans don't think one token at a time when speaking, why train AI to predict the next token?

Why not predict the next sentence?

Or even further: Why not directly predict a complete answer, a complete plan, a complete conclusion?

If I weren't looking back from today, but designing an AI from scratch, I probably wouldn't think of "predicting the next token" first either.

I would first think: The machine needs to store knowledge, record its current state, have reasoning rules, be able to calculate, deduce, and predict, and know what it doesn't know.

These ideas are all correct.

But they immediately encounter a harder question:

How do you represent these things?

How do you segment a knowledge base? How do you define state variables? Who writes the rules? When does reasoning start, and when does it stop?

The world is too vast. The more you try to design intelligence from a God's-eye view, the more likely you are to get stuck at the step of "first modeling the world clearly."

The cleverness of next-token prediction lies precisely in the fact that it does not, from the very start, require us to explicitly design the entire intelligence.

It shrinks the problem down to an extremely small action:

Given the preceding text, what is the next token?

This action is so small it's almost boring.

But it has three engineering advantages.

First, its supervisory signal is extremely dense. A piece of text with one thousand tokens is not just one training sample, but approximately one thousand small problems.

Seeing the 1st token, predict the 2nd. Seeing the first 2 tokens, predict the 3rd. Seeing the first 999 tokens, predict the 1000th.

If you changed this to "predict the next sentence," the supervisory signal would suddenly become sparse. One sentence can have many equally sensible ways of being written. Its boundaries are also unstable: where does one sentence end? How long should the next sentence be? Is it an answer, an explanation, a counter-question, or silence?

Second, tokens are composable. A sentence is not an atom. It is a trajectory.

I → think → that → the → key → to → this → problem → is → state

Each step constrains the next. Each step also exposes whether the model truly understood the previous state. Predicting the next sentence isn't wrong; it just hides all the intermediate steps of the entire trajectory.

Third, small-step predictions can be corrected repeatedly. During training, the model gets feedback at every step. When the real token appears, if the model assigned it a low probability, it is penalized by cross-entropy. This means it doesn't just realize it was wrong at the end of the article. It is questioned at every position:

Do you really know what's happening right now?

So, the goal of predicting the next token is not small in ambition; it is just that the supervisory signal is extremely dense. It isn't simplifying intelligence into a toy problem; it's breaking a problem too big to solve into countless learnable small steps.

The same goes for playing chess. If you train a model to predict the next move in a chess record, ostensibly it's just guessing a placement. But to guess well enough, it must understand the board, threats, initiative, tempo, position, and long-term gains. The next move is not the entirety of intelligence, but it is the smallest cross-section intelligence exposes on the board. The next token is the same. It is not the entirety of language, but it is the smallest cross-section linguistic intelligence exposes in text.

5. How Does a Boring Goal Force Out Intelligence?

The truly profound part is here:

The next token itself has no magic.

The magic lies in the fact that humans have written the traces of a vast amount of intelligent activity into text.

Novels are compressions of human emotions and relationships. Papers are compressions of human concepts and evidence. Code is a compression of human operations and rules. Textbooks are compressions of knowledge structures. Chat logs are compressions of human intent, politeness, misunderstandings, and negotiations.

The internet is not a clean database of truths, but it is a vast ruin of behaviors.

When a model is tasked with predicting the next token in these texts, it is effectively forced to ask:

What kind of world would generate a sentence like this? What kind of knowledge would make this word more likely to appear? What kind of reasoning would lead the next step here? What kind of tone would make this reply seem natural? What kind of code state would cause this line to be followed by that one?

For example:

The capital of France is ___

To predict "Paris," it needs knowledge.

If A is greater than B, and B is greater than C, then the relationship between A and C is ___

To predict "greater than," it needs reasoning.

def factorial(n): if n == 0: return 1 return ___

To predict n * factorial(n - 1), it needs code patterns and recursive structures.

User said: This paragraph is too harsh, help me rephrase it more gently. Assistant replied: ___

To provide a good follow-up, it needs to understand intent, tone, and social context.

This is why a seemingly dull objective forces out abilities that look very complex.

It's not because "predicting the next word" naturally equates to "understanding the world." It's because within real corpora, the next word is often the outcome of many hidden variables acting together. Grammar, facts, causality, roles, goals, style, context, and task constraints all push that word into place. To minimize its loss, the model must learn to compress these hidden variables into its state. It may not understand in the same way a human does, and it may not be reliable. But from the standpoint of the optimization goal, it is indeed forced to learn many structures it "would not predict well" if it didn't learn.

This also explains why we shouldn't underestimate next-token prediction. It's not just a slogan. It's a method that converts the problem of intelligence design into a representation learning problem:

Don't hand-craft the knowledge structure first. First, give the model an extremely dense sequence prediction task. Let it discover for itself what should be compressed into the current state.

This is the most counter-intuitive part of modern AI. We thought intelligence should start from "knowledge bases, rules, and inference engines." Instead, it starts from a problem so small it couldn't be smaller:

What is the next symbol?

Then, with sufficiently large data, sufficiently large models, and sufficiently long training, this small problem begins to swallow the bigger problems in return.

6. This Thread Was Buried in the Literature Long Ago

This line of thinking wasn't only retroactively explained today. Many works have left clues at different points.

When Markov studied "Eugene Onegin" in 1913, he wasn't doing modern NLP, but he had already put the idea that "text is not a pile of independent symbols" on the mathematical table. The next letter is influenced by the previous state. Language can be viewed as a stochastic process with dependencies.

When Shannon wrote "A Mathematical Theory of Communication" in 1948, he used a similar idea. He had readers look at zero-order, first-order, second-order, and higher-order approximations of English. The higher the order, the more the generated text resembled English. This is significant; it shows that "resembling language" can gradually emerge from local conditional probabilities.

Later, the n-gram language model directly engineered this idea: using the previous n-1 words to predict the next word. This became one of the earliest standard forms of language models.

When Bengio et al. proposed the neural probabilistic language model in 2003, they noted that traditional n-grams face the curse of dimensionality. Their solution wasn't to abandon the problem of "predicting word sequence probabilities," but to use a neural network to learn distributed representations of words, allowing generalization between similar contexts.

Subsequently, GPT-2 stated the matter even more plainly: large models are trained with a simple objective: given the preceding words, predict the next word. Then, on massive amounts of web text, this objective naturally encompassed "natural demonstrations" of tasks like Q&A, translation, summarization, and reading comprehension. GPT-3 further demonstrated that scaling up the model and data significantly enhances few-shot capabilities.

This isn't saying the authors of these papers foresaw everything about today. More accurately, they proved the same thing layer by layer:

Sequence prediction is not a sideshow; it is a main highway towards linguistic intelligence.

When it comes to Agents, another thread of literature connects. In reinforcement learning, MDPs/POMDPs are not concerned with the next word, but rather: in the current state, what action will take the system toward a better future?

Thus, the language model thread and the reinforcement learning thread meet at the Agent. One is responsible for learning the shadow of the world from text; the other is responsible for selecting the next action within that world.

This is Markov's true influence on AI. He did not leave behind just a formula for the "Markov chain." He left behind a way of looking at the world:

Don't first ask what the entire history is. First, ask if the current state is good enough, and what the next step should be.

7. Is an LLM a Markov Chain?

This question is easy to answer incorrectly.

If you say, "An LLM is a Markov chain," that's inaccurate. Because a typical first-order Markov chain only looks at the current token: P(x_t | x_{t-1}). An LLM, however, looks at the entire context: P(x_t | x_1, x_2, ..., x_{t-1}). It doesn't just look at the previous word, so it's not a "first-order word-level Markov chain" in the traditional sense.

But if you say, "LLMs have nothing to do with Markov," that's also inaccurate. The generative approach of an LLM is essentially still a step-by-step decomposition of conditional probabilities: P(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} P(x_t | x_{<t}). In other words, the probability of the entire text equals the probability of the first word, multiplied by the probability of the second word given the first, multiplied by the probability of the third word given the first two, and so on.

At every step, an LLM is asking:

Based on everything that has been generated so far, what should the next token be?

This is highly consistent with the Markov spirit. The key difference is that the "current state" has become much more complex. For a first-order Markov chain, the current state is the previous word. For an n-gram, it's the previous n-1 words. For a Transformer, the current state is the hidden representation of the entire context compressed through Attention. More mechanically, the current state consists of tokens, position, residual streams, and the KV cache. This is the core point.

The LLM has not escaped the "state → next step" framework. It has just made the "state" massive, continuous, and learnable. The state of a first-order Markov chain is like a signpost. The state of an LLM is like a self-folding map. The former just tells you "which point you are standing at." The latter compresses the path you've traveled so far, the markings on the road, the task goal, the tone, and the implicit rules, all into a high-dimensional space. Then it asks:

Where to go next?

That is why I say Markov's ghost still resides within the LLM. It's no longer a crude transition matrix; it has evolved into a giant machine that learns state representations.

8. From the Next Word, to the Next Action

If the story ended with LLMs, Markov would already be very important. But what's truly interesting is the Agent, because an Agent doesn't just predict the next word; it must predict the next action.

A typical Agent loop looks like this: observe the environment, update the context, think about the next step, invoke a tool/write a file/execute a command, get a new result, and continue. Isn't this a Markov chain? Almost. More precisely, it resembles a Markov Decision Process (MDP).

An ordinary Markov chain only has state transitions: state → state. An MDP adds two more things: actions and rewards. State + Action → New State + Reward. The MDP state corresponds to the Agent's current context, file contents, tool returns, and task goal. The action corresponds to calling a tool, searching, editing a file, running a test, or replying to the user. The transition describes how the environment changes after the action. The reward is whether the task was completed, the test passed, or the user was satisfied. The policy is the Agent's rule for choosing the next action.

Strictly speaking, a real Agent often cannot see the complete state of the world. It only sees partial observations: terminal output, webpage snippets, file contents, tool returns, and new user instructions. The true world is larger than what it perceives. So, a more accurate engineering model is a Partially Observable Markov Decision Process (POMDP). But the intuition remains the same: observation results lead to an updated belief state, which leads to the selection of the next action.

The Agent's context is its belief state about the "current world." So the core question for an Agent is not "What is the next token?" but rather "In the current state, what is the next step most likely to advance the goal?"

This is the shift from a language model to an action model. The LLM predicts text sequences; the Agent predicts action sequences.

LLM: Text state → next token → new text state Agent: World state → next action → new world state

This is also why reinforcement learning naturally appears. Because once you have "actions," you must face the distinction between "good actions" and "bad actions." A sentence that flows smoothly doesn't guarantee the tool call is correct. A plan that looks brilliant doesn't guarantee it won't delete the wrong files upon execution. A web browsing action that seems reasonable doesn't mean it actually found the key evidence. An Agent isn't writing sentences on paper; it's changing the environment. Once you change the environment, the world will either slap you back or give you a reward. This is the world of the MDP.

9. Why Claude Code Resembles a Markovian System

Take a coding agent like Claude Code as an example; the Markovian flavor is even more pronounced.

When it receives a task like "Fix this test failure," it doesn't directly conjure an answer out of thin air. It advances the state round by round.

State 1: Knows only the user's problem. Action 1: Reads the test output. State 2: Knows which test failed. Action 2: Searches for related code. State 3: Knows the failing code path. Action 3: Opens the file. State 4: Knows the implementation details. Action 4: Modifies the code. State 5: Code has been changed. Action 5: Runs the test. State 6: Test passes or fails. Action 6: Continue fixing or summarize.

Each step's action depends on the current state. And the current state is more than just the chat history. It includes the user's goal, the codebase structure, files already read, command outputs, test results, tool permissions, unresolved issues, and modifications already made. This is a high-dimensional state.

What Claude Code does well isn't just the phrase "can write code." More precisely, it can repeatedly choose the next action within a constantly changing state space. This is different from traditional IDE autocomplete. Autocomplete only asks: "What might the next line of code be?" A coding agent asks: "To accomplish this goal, where should I look next, what should I change, what should I run, and what should I verify?"

This is the leap from token prediction to action prediction. The shadow of the Markov chain remains; it's just that the "state" has transformed from a single letter to an entire codebase and task context. The "transition" has gone from one symbol following another to reading files, changing code, and running tests. The "probability" has gone from a small transition table to the judgment of a large model in a high-dimensional space.

10. The Traps of the Markov Property

Up to this point, Markov might seem like a master key. But it has its traps. The trap lies in that phrase: "The next step depends only on the current state."

The question is: Is your current state sufficient?

If the state is too thin, you forget crucial history. For example, a customer service Agent that uses only the user's last sentence as its state. If the user says, "Then let's go with the plan we just discussed," and the Agent has forgotten what "the plan we just discussed" was, it is finished.

If the state is too thick, you can't compute it. Shoving the entire internet, the whole codebase, all historical conversations, and all tool outputs into the state is theoretically optimal, of course. But context window limits, attention costs, retrieval quality, and noise interference will collectively drag you down.

So, the truly difficult part of modern AI systems is state design. Which history must be preserved? Which can be compressed? Which should be retrieved? Which must be forgotten? Which should be written into long-term memory? Which only belongs in the current context?

This is why concepts like RAG, Memory, Context Engineering, and Agent State have become so important. Popular personal agents like OpenClaw, and the model-native harness emphasized in the OpenAI Agents SDK, are also answering the same question. On the surface, they are adding tools, browsers, terminals, file systems, long-term memory, and permission boundaries. Viewed more fundamentally, they are constructing a world that the model can see, operate in, and be constrained by at every step.

They are asking: Before the next action happens, what should the system present to the model? They are all circling the same question:

How to construct a sufficiently good "now," so that the model can make the correct next step?

This sentence is more fundamental than "how to write a prompt." The prompt is only one part of the state; tool returns are a part; the file system is a part; the user goal is a part; historical decisions are also a part of the state. The core of Agent engineering isn't writing the while-loop; the while-loop is easy. The hard part is what the "current state" in the model's hands actually looks like at the start of each loop iteration.

11. Three Underlying Threads Finally Connect

Now, let's pull together the threads we've written about earlier. Shannon said: "Understanding is compression." Bayes said: "Learning is updating beliefs." Markov said: "An intelligent agent must always stand in the present and predict the next step."

Combine these three sentences, and you have the skeleton of LLMs and Agents.

Shannon: Compress history into structure. Bayes: Update the distribution upon seeing new evidence. Markov: Choose the next step based on the current state.

When an LLM generates text, these three things happen simultaneously: the context is compressed into a hidden representation (Shannon), the next-token distribution is updated (Bayes), and the model moves from the current state to the next step (Markov). When an Agent acts, these three things also happen simultaneously: environmental information is compressed into context (Shannon), tool results update the task belief (Bayes), and the next action is selected from the current state (Markov).

Thus, Markov is not an isolated topic for a "probability theory explainer." He is the third pillar of our entire AI understanding framework. Shannon gave us the "information" lens; Bayes gave us the "learning" lens; Markov gave us the "process" lens. Without Markov, it would be hard to understand why "predicting the next word" can sprout linguistic ability, and hard to see why the essence of an Agent is not a set of tools, but an ever-unfolding action trajectory.

12. What He Could Not Have Imagined

Let us return to Pushkin. When Markov was counting vowels and consonants, he was concerned with a very technical problem in probability theory: Can the assumption of independence be relaxed? He wanted to prove that even when variables have dependencies, certain limit theorems still hold. This sounds quite narrow, so narrow it almost doesn't seem like a world-changing problem. But many great ideas are like this when they first appear. Shannon was initially solving the problem of how to transmit signals over telephone lines. Bayes was initially solving an inverse problem in a posthumous manuscript on probability. Markov was initially counting vowels and consonants in Pushkin's verses. None of them were "inventing AI," but they all laid a foundation for it.

The foundation Markov left can be condensed into six words:

The state determines the next step.

Of course, this sentence must be understood carefully. It does not mean that fate is already written, nor that there is no freedom in the future. It means that if you want a machine to act in time, you must give it a state so that it can deduce the next step from that state. This is the inescapable core of everything from text completion to code agents, from chatbots to autonomous driving, from game AI to robotic control. Intelligence is not static. Intelligence always unfolds in time. And as long as intelligence unfolds in time, Markov's ghost is there, standing behind every single "next step."

Main References and Further Reading

A. A. Markov, Extension of the limit theorems of probability theory to a sum of variables connected in a chain, 1906
A. A. Markov, An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains, 1913
Claude Shannon, A Mathematical Theory of Communication, 1948
Richard Bellman, Dynamic Programming, 1957
Daniel Jurafsky and James H. Martin, Speech and Language Processing
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, A Neural Probabilistic Language Model, 2003
Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, 2015
OpenAI, Better Language Models and Their Implications, 2019
OpenAI, Language Models are Few-Shot Learners, 2020
Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2018