OpenAI Post-Training Lead: AI Isn't Suddenly Stronger, It Just Crossed a Threshold

NYC VC Matt Turck and OpenAI's Post-Training Lead Yann Dubois had an in-depth interview.

From the inside story of GPT-5.5's release, to why reinforcement learning suddenly works, to the industry's biggest unsolved puzzles, this conversation is packed with insights.

Who They Are

Yann Dubois is the co-lead of the Post-Training Frontiers team at OpenAI. His team is responsible for taking a large model that knows everything but isn't very helpful, and turning it into a product that actually delivers.

GPT-5.5, o3, GPT-5 Thinking — all of OpenAI's recent core reasoning models have gone through his team's hands.

Dubois is Swiss. He studied bioengineering at EPFL for his undergrad, later earned a master's in machine learning from Cambridge, and then went to Stanford for a computer science PhD on a Knight-Hennessy scholarship. Before his PhD, he also worked on NLP at Grab in Singapore, building language processing pipelines for lesser-resourced languages like Thai, Khmer, and Burmese, reaching 40 million users.

During his time at Stanford, he did two highly influential things: first, Stanford Alpaca, which fine-tuned an open-source model to near GPT-3.5 performance for under $600, igniting the entire open-source post-training community. Second, AlpacaEval, which remains one of the industry's most widely used automatic evaluation tools for instruction-following models.

When GPT-5 launched last year, he went on stage for a live demo: he had GPT-5 build a French learning app for his Francophone family, complete with flashcards, a quiz, and a Snake mini-game, writing 240 lines of code and running it in two minutes. (According to him, it failed during the final rehearsal, so he was quite nervous on stage.)

Matt Turck is a partner at NYC-based early-stage venture capital firm FirstMark Capital. Since 2012, he has published the annual MAD (Machine Learning, AI & Data) Landscape, an essential yearly map of the AI industry; the 2024 edition crammed in 2,011 company logos. He is also French and previously co-founded the enterprise AI search engine TripleHop, which was later acquired by Oracle.

Crossing the Reliability Threshold

Yann started with a core judgment: AI progress has actually been continuous, but people perceive it as a step function.

Why is that? He gave three reasons.

First, and most crucially: reliability has finally crossed the tipping point.

"You need to hit this level of reliability for AI tools to be truly useful. I think we crossed that line around December last year, at least at OpenAI. We can now trust these models for a lot of the work we do."

He used an analogy: if you think of an agent model as a system with a certain probability of glitching every two minutes, the longer it runs, the higher the chance the final answer is wrong. Their entire focus has been on steadily reducing that 'glitch every two minutes' probability.

Once that probability gets low enough, the user's perception shifts dramatically, even if the underlying progress is gradual.

Second reason: models are starting to accelerate themselves.

OpenAI internally uses its own models extensively to write code, build tools, and conduct research. As models get stronger, the speed of internal R&D picks up, creating a positive feedback loop.

Third reason: reinforcement learning has moved from competitions to the real world.

Last year's o1 and o3 focused mainly on math and coding competitions, because it's easy to judge right and wrong in those scenarios. This year, they discovered that the tools and methods developed for "verifiable rewards" can also be applied to real-world situations.

Moving from competitions to practical use is why people are feeling this AI progress right now.

Launching GPT-5.5 Felt Like a Rollercoaster

Every seemingly good model goes through an emotional rollercoaster inside OpenAI: at first everyone is excited, then gradually people start to question it, saying it fails on this task or has issues in that area, leading to a "skepticism phase."

"This fluctuation happens with every model. GPT-5.5 was no exception, but its amplitude was probably the largest. People were first extremely excited, then much less so, and finally we launched, and the external feedback was great."

Emotional Rollercoaster of Launching GPT-5.5

Asked about what he's most proud of, he mentioned two things.

One is efficiency: GPT-5.5 is about twice as fast on most tasks.

The other is whole-company alignment. This model's success required every team, from pre-training to inference optimization to post-training, to push in the same direction.

Vertical and Horizontal Teams

This led to a question: how are OpenAI's teams actually organized?

Yann explained they have two types of teams.

Vertical teams focus on specific application areas—for example, some specialize in agent coding, some in computer use, some in knowledge work. Each team drives improvements in its own vertical.

Horizontal teams, like Yann's own team, do three things:

They decide what goes into the final training run and what doesn't; they integrate all vertical improvements for the large training runs; and they work on universal improvements across all applications, such as instruction following, function calling, and thinking time allocation.

Collaboration of Vertical and Horizontal Teams

The benefit is that vertical and horizontal improvements can progress orthogonally. Maybe only half the vertical teams made improvements in this version; the next version will be the other half's turn.

Thinking Efficiency

What exactly is the difference between GPT-5.5 Thinking and GPT-5.5 Pro?

Yann's answer: essentially, it's just the amount of compute used at test time. The longer the model thinks, the higher the probability of a correct answer. But this curve is logarithmic—doubling the compute might only yield a tiny improvement.

He himself rarely uses Pro.

"Personally, I don't use Pro much because I'm impatient and don't like waiting that long. Accuracy does increase, but the improvement just isn't enough for me yet."

But there's one group who loves Pro: mathematicians.

They can throw a problem at Pro and let it run in the background for an hour or two, without needing rapid iteration.

So what's behind the efficiency gains?

Yann used an analogy: an expert and an intern doing the same task. The intern might spend a day or two, trying ten different approaches because they don't know which path is correct. An expert, with experience, knows which direction to take and won't waste time on dead ends.

The efficiency improvement in models is essentially about making them "experts" who know which reasoning pathway is more likely to be correct.

Larger models are inherently more efficient because they have already "thought through" part of the problem via their weights, without needing extra tokens at inference time. While a bigger model means higher cost per token, large models are easier to parallel-optimize on GPUs, so overall efficiency is actually better.

The Pre-Training Wall Isn't There

A big narrative in the AI industry last year was that "pre-training has hit a wall."

Yann said he thought the same two years ago, but looking back, that wall never appeared.

"Look at Anthropic's Mythos. Judging by model cost, it is clearly a much larger model. And it achieved great performance simply by scaling up the model size. I think at least some in the industry were surprised by that."

What about the data wall? Isn't there a shortage of data?

He says companies seem to have found their own ways to work around the problem of insufficient internet data. As for whether it's via multimodal data or synthetic data... he can't say much, but offered a candid observation:

"Look at Anthropic's models—they aren't particularly strong on multimodal, yet they are still very smart. So multimodal data is, at least, less necessary than I used to think."

He believes the moment multimodal data truly shines might have to wait until embodied AI matures. A robot interacting with the physical world could help a model gain common sense that is very hard to learn from text alone, like... the feeling of gravity.

From Library to Expert

Yann used an easy-to-grasp analogy to explain the "Pre-training → Mid-training → Post-training" pipeline.

Pre-training is like walking into a library. Theoretically, all information is there, but you have to find it yourself. And the library contains everything—ads, forum posts, Wikipedia—all learned indiscriminately.

Mid-training is like selecting high-quality books from the library and reading them multiple times. Content with higher information density, like Wikipedia or GitHub code, gets weighted training.

Post-training transforms a "book-smart person" who has read everything into an "expert" you can directly ask. You no longer need to look things up; just ask him, and he understands your question and gives a useful answer.

The two core stages of post-training are:

SFT (Supervised Fine-Tuning): Human annotators provide standard answers, and the model imitates them. The problem is that the model's capability gets capped at the annotator's level; it can never surpass the "teacher."

Reinforcement Learning (RL): Instead of providing a standard answer, you provide a judging rule. The model tries various responses on its own; correct ones get rewarded, wrong ones get penalized. This way, it can surpass the level of human annotators.

The open-source community's typical approach is: first, do SFT to get the model to a decent baseline, then use RL to break through the ceiling. Diving straight into RL is too inefficient because the model needs to "stumble upon" the correct answer to be rewarded.

Why Reinforcement Learning Started Working

Wasn't reinforcement learning notoriously finicky before?

Yann admitted that two years ago, most researchers (including himself) thought RL was too unstable to bother with. When he saw ChatGPT using RLHF, his first reaction was: I can do just as well without RL. Stanford Alpaca came from this thinking, aiming to replicate ChatGPT's effects using only SFT.

Yann LeCun also famously said that reinforcement learning is just the cherry on the cake.

But things have changed.

"It seems that once the model crosses a certain size threshold, meaning it already has a good enough prior knowledge of the world, reinforcement learning starts to work. This isn't just an LLM phenomenon. Robotics seems to be entering the same phase—they are also finding that using models that already understand the world for RL is much more effective."

In the open-source community, methods are also converging. Previously there was PPO, DPO, various XPOs, and now almost everyone uses GRPO. The reason is simple: GRPO is a minimal method—sample many answers, judge which one is correct, reinforce the correct one.

"In machine learning, we see this pattern repeatedly: the simplest method that can scale with compute is always the one that wins in the end."

But RL is not without its challenges.

At the infrastructure level, the computational cost of sampling a huge number of answers is quite substantial.

At the machine learning level, the most painful issue in agent tasks is "attribution." An agent runs a long reasoning process and ultimately gets a right or wrong result. But which specific step led to success or failure? The information is too sparse for precise attribution.

Craft or Science?

Some say AI systems aren't so much "built" as they are "grown." Yann's perspective:

"The typical pattern is: it starts as a craft. People try many things and gradually build intuition about what works and what doesn't. Then, over time, it transitions into science."

"The scientific method is rarely the one that works first. It's very rare for someone to come up with a rigorous theoretical derivation that says 'this is optimal' and it just works from the start. People have this kind of alchemical intuition, they figure it out first, and then they go back and understand why it works."

Craft first, science later. Both are indispensable, just at different stages of the pipeline.

The Truth About Generalization

GPT-5.5 performs well in agent coding, computer use, and knowledge work. Is that because of specialized training in each domain?

Yann believes generalization happens mainly at the capability level, not the domain level.

If a model excels at math competitions, it usually doesn't do poorly in coding competitions either. Because the underlying required capabilities are the same. Conversely, if a model has a flaw in one area (like hallucination), it will have that flaw across all domains.

But there's one type of generalization that remains a challenge: from precisely defined problems to the fuzzy real world.

"Math and coding competition problems are defined very precisely—five or fifteen lines contain all the information you need to solve it. But in the real world, if I'm a consultant or a finance professional, I first need to go online, search, and extract various pieces of information just to understand the problem itself, only then can I begin reasoning."

This is also why hallucination exists in every domain: the habit of fabricating when it doesn't know is a horizontal capability defect, not a domain-specific problem.

How RL Tackles Hallucination

On the topic of hallucination, Yann cited a classic analysis by John Schulman.

SFT can actually create hallucinations. Why?

Suppose a model doesn't know about the existence of a certain paper, but in the SFT training data, an annotator cited that paper as the source for an answer. The model is trained to mimic this response, and as a result, it learns to cite something it doesn't actually know exists.

Reinforcement learning naturally avoids this pitfall.

Because RL starts from the model's own sampling. The model is unlikely to spontaneously generate something it doesn't know about and then happen to be right. So it is almost never rewarded for "fabrication." Conversely, if it generates something it doesn't know and gets it wrong, it gets penalized, and that behavior is suppressed.

Different Effects of SFT vs RL on Hallucination

SFT teaches the model to 'confidently cite non-existent things,' while RL teaches the model to 'don't say what you don't know.'

Explicit vs. Implicit

But RL can also cause some "negative generalization."

Yann gave a concrete example: explicit instruction following vs. implicit instruction following.

If you ask a model to edit a file but make a typo in the filename, a model highly trained on explicit instruction following will dutifully try to modify the misspelled file. A human colleague, however, would probably notice the typo and correct it automatically.

"Sometimes we hear feedback that OpenAI models are great when you explicitly tell them what you want, but if you're not explicit enough, they aren't as good."

Explicit vs Implicit Instruction Following

This illustrates a potential conflict between horizontal capabilities: the better you get at explicit instruction following, the more you might regress in understanding implicit intentions.

Can RL Cover the Entire Industry?

So, can reinforcement learning truly be extended to all fields like law, medicine, and finance?

Yann believes it can, but there are two real-world bottlenecks.

One is the people bottleneck. Most people building AI models are programmers themselves, so they naturally understand what's needed for coding tasks. But to make a model excel in law, you need people who truly understand law to participate in evaluation and data collection, and such people are scarce.

The second is the difficulty of reward design. Some fields are naturally easy for RL, like cybersecurity: you find a vulnerability, it is either real or fake, extremely cheap to verify. But in law or medicine, the standard for "correct" is much more ambiguous.

"The model doesn't have any inherent capability limitations that would prevent it from eventually doing well in law or medicine. The real constraints are: we don't understand these domains well enough yet, and some domains are indeed easier for reinforcement learning than others."

The Evaluation Conundrum

The stronger models get, the harder they are to evaluate.

"Now I might just say 'help me build a website that does X.' Before, I would say 'are there any bugs in this code?' The latter is easy to judge, because you can list all the bugs and compare automatically. But the first one has many correct answers; there are many ways to build a website well."

Another tricky issue: models have surpassed most humans in certain areas, meaning the pool of people qualified to evaluate them is shrinking.

There's also a cultural factor:

"Most people want to do model training; they feel that's the high-impact work. But finding problems and quantifying improvements is equally, if not more, important. There's always this cultural gap, though."

When he himself joined OpenAI, his first choice was to work on data and evaluation, because he knew nobody else was doing it, so the impact would be greatest.

Model-as-Judge is one of the directions he considers most important. Better models can become better teachers and judges for other models, creating a capability flywheel.

But this also creates an awkward side effect: every time you build a good evaluation set, it is simultaneously a high-quality training set. Once a model trains on similar data, it can score high on that evaluation, rendering the evaluation useless.

Evaluation's Shelf Life is Getting Shorter

Evaluations' shelf lives are getting shorter and shorter.

Three Years On, Still Unsolved

Yann said the direction he is most excited about is Continual Learning, but he also admits this problem hasn't really been solved yet.

A friend once proposed a mental framework that he found quite illuminating:

Imagine a coordinate axis, with the X-axis representing time and the Y-axis representing usefulness to the user.

At t=0, an AI model might be more useful than most new hires—a fairly high starting point. But after that, the curve is basically flat, because the model doesn't truly learn internal company knowledge or become more efficient over time.

A new human employee's starting point is lower, but their learning curve is much steeper.

What truly matters is the area under the curve, which represents cumulative value. By that metric, humans still come out ahead in many scenarios.

"When ChatGPT first came out three years ago, a friend and I were thinking about starting a company around continual learning and personalization. We thought, ah, OpenAI will definitely figure this out within six months. They have all the data, all the users, the model would learn from users incredibly fast. Three years later, I feel like we are still not there yet."

He candidly admitted he doesn't fully understand why it's so hard. For continual learning for a single user, he believes it should be solvable if enough resources were truly thrown at it.

But to this day, it remains an unsolved mystery.

The Shelf Life of Harnesses

On a hot topic in the AI industry recently: will the model eventually "eat" the agent frameworks (harnesses)?

Yann's attitude is pragmatic: harnesses are useful in the short term, but don't count on them being durable long-term.

"If you're a company focused on a specific vertical, and you want to push reliability from 80% to 85%, a harness can help you do that. But you need to know that this harness will need to be re-adjusted in the future."

"If you're trying to build a general harness that can stay stable long-term, I think that fundamentally won't work."

Then he said something surprising:

"If we froze the current models and seriously worked on harnesses, I think people would feel AGI in almost every field."

His point is: current model capabilities are already sufficient; what's missing is packaging and the last mile of engineering. But because models are constantly improving, the optimal harness is also a moving target, so nobody knows what its final form will look like.

The Last Mile

At the end of the conversation, Matt asked a question founders care about most: as models get stronger, is there still room for startups?

Yann immediately agreed.

"Many people see the bottleneck as 'intelligence' itself, the raw capability of the model. But I don't think so. Most of the time, the real bottleneck is the last mile."

"Ensuring the model has the right permissions, the right data connectors, the right domain knowledge. We are intensely focused on pushing general capabilities, and the value mining in specific verticals should be done by other companies."

He encouraged founders to continue digging deep into verticals. In his view, until OpenAI stops making horizontal progress (which he doesn't see happening in the near term), the space for startups in verticals will always exist.

The Last Mile from Generalist to Specialist

Models are generalists, but users need specialists. The distance from generalist to specialist is where startups find their living space.

◆ ◇