Meng Chen from Aofeisi | QbitAI
GPT-5.5 has just arrived.
Officially positioned as "a new form of intelligence designed for real-world work and agents."
This time, Altman didn't step out himself to say he was "so stunned by the first experience that he fell into a chair, feeling like he witnessed an atomic explosion." Instead, he brought in a group of "mouthpieces" (early testers).
Among them was an NVIDIA engineer who, after briefly losing access to GPT-5.5 following early testing, made this statement:
"Losing GPT-5.5 feels like having a limb amputated."
Jokes aside.
The collaboration between OpenAI and NVIDIA this time is unprecedented.
First, GPT-5.5 and the NVIDIA GB200 and GB300 NVL72 systems were co-designed. From training to deployment, the model and hardware have been on a "two-way rush" towards each other since birth.
Second, Codex has been rolled out across all of NVIDIA. Altman even shared emails with Jensen Huang.
Let's look at the data regarding the results of this collaboration.
Compared to the previous version, GPT-5.4, the new model pulls ahead in three areas: coding, knowledge work, and scientific research.
Regarding the comprehensive Artificial Analysis Intelligence Index results, there are two ways to interpret them:
GPT-5.5 achieves the same score as Claude Opus 4.7 and other models but consumes fewer tokens.
Or, consuming the same number of tokens, GPT-5.5 completes more tasks.
But the most surprising part isn't the benchmark scores.
In every past model upgrade, "stronger" and "slower" were almost always sold as a package deal.
This is the price of Scaling Law: larger models, more parameters, and longer thinking times. Users pay for intelligence, but they also pay for latency.
GPT-5.5 breaks this iron law.
In real production environments, its per-token latency is comparable to GPT-5.4, yet it requires fewer tokens to complete the same tasks.
Higher efficiency, more powerful features.
(But the price has doubled.)
As of press time, the latest version of Codex is already powered by GPT-5.5.
Supercharging Programming
Programming is the field where GPT-5.5 has improved the most drastically.
With the previous generation of models, you still had to carefully break down tasks, watch it step-by-step, and be ready to correct course at any moment.
GPT-5.5 is different. You throw the requirements at it, and it decomposes, executes, and checks itself. You just look at the result.
OpenAI demonstrated a 3D action game generated by GPT-5.5 under Codex, running directly in the browser.
This includes implementing combat systems, enemy encounters, HUD feedback using TypeScript/Three.js, and environment textures generated by GPT.
On Terminal-Bench 2.0, a hardcore test measuring complex command-line workflows, GPT-5.5 scored 82.7%.
The previous version, GPT-5.4, scored 75.1%, while the current strongest competitor, Claude Opus 4.7, scored 69.4%.
It can be understood this way: when facing problems of this difficulty level, nearly one-third of the previous generation's models would get stuck. Now, that ratio has been pressed below one-quarter.
Next, let's hear from various "mouthpieces":
Early tester Dan Shipper conducted an experiment. He is a startup CEO and an active AI product developer.
After his app launched, a bug appeared. He hired a top-tier engineer to refactor it. The engineer spent considerable effort and eventually provided a solution.
Then Shipper turned back the clock: he fed that buggy code to the model to see if it could independently make the same decision as that engineer.
GPT-5.4 couldn't do it. GPT-5.5 did.
Shipper said this was the first time he felt true "conceptual clarity" from a programming model.
It's not just replying; it understands the problem and figures out how to solve it on its own.
More and more senior engineers are reporting the same thing: GPT-5.5 is significantly stronger than GPT-5.4 and Claude Opus 4.7 in reasoning and autonomy.
It can identify issues in advance and predict testing and review requirements without explicit prompting.
Programming is just the beginning. This same leap in capability is spreading towards knowledge work and scientific research.
Beyond Programming
What GPT-5.5 does in Codex goes far beyond writing code. It generates documents, organizes spreadsheets, and makes PPTs.
OpenAI emphasizes multiple times that it understands what you want better than the previous generation.
More critically, it uses tools on its own and checks its own output for correctness. You give it a vague idea, and it completes the rest for you.
Here is an interesting statistic: over 85% of OpenAI's own employees use Codex for work every week. (What about the other 15%?)
Let's look at the evaluation results first.
On the knowledge work benchmark GDPval, GPT-5.5 scored 84.9%, which is 4.6 percentage points higher than Claude Opus 4.7.
FrontierMath Tier 4, one of the hardest math benchmarks currently, with questions from unpublished papers and open problems from top researchers.
GPT-5.5 Pro achieved 39.6% on this test. Claude Opus 4.7 scored 22.9%, a gap of nearly double.
What's truly interesting is how scientists are using it.
Bartosz Naskręcki is an Assistant Professor of Mathematics at Adam Mickiewicz University in Poland. He wrote one sentence to Codex, and 11 minutes later, an algebraic geometry visualization application was running.
This application can draw the intersection line of two quadric surfaces, mark it in red, and use the Riemann-Roch theorem to transform the intersection line into the standard form of a Weierstrass curve. Later, he expanded it with more stable singularity visualization features.
One sentence, 11 minutes. In the past, just setting up the project framework would take half a day.
Derya Unutmaz is a Professor of Immunology at the Jackson Laboratory for Genomic Medicine. He used GPT-5.5 Pro to analyze a gene expression dataset: 62 samples, nearly 28,000 genes. The result was a complete research report.
He said this would have originally taken his team several months.
OpenAI has a very accurate summary for GPT-5.5's positioning in research: it is no longer like a one-time answer engine, but more like a "research partner."
Early testers aren't just using it to look up information. They use it for multiple rounds of paper revision, picking out flaws in arguments point by point, and proposing new analysis 方案。It remembers your entire research context; every round of dialogue is built upon the previous one.
GPT-5.5 has done something major in the field of mathematics.
Ramsey numbers, one of the core problems in combinatorial mathematics.
In layman's terms, it studies: How large does a network need to be to guarantee that a certain order inevitably appears?
For example, among six people, there must be three who know each other or three who don't know each other; this is the simplest Ramsey theorem.
It has been a hard bone to crack in the mathematical world for decades. The asymptotic properties of off-diagonal Ramsey numbers have remained unresolved for a long time.
GPT-5.5 found a new proof path. It didn't reproduce a known method but discovered a new route. Subsequently, this proof was confirmed error-free by Lean, one of the most rigorous formal verification tools in mathematics.
An AI has made an original contribution in the core field of pure mathematics, verified by formal tools.
A year ago, this was unimaginable.
The Secret to Being Stronger Yet Faster
How was "stronger yet faster" achieved?
The answer isn't optimization in just one link. OpenAI tore down and rebuilt the entire inference system.
As mentioned earlier, GPT-5.5 and the NVIDIA GB200 and GB300 NVL72 systems were co-designed, resulting in a massive leap in intelligence levels under equivalent latency.
But there is another story.
The Codex system powered by GPT-5.5 analyzed weeks of production traffic data and then wrote a load-balancing partitioning heuristic algorithm.
Previously, requests were cut into a fixed number of chunks and distributed to accelerators. However, fixed chunking strategies are not always optimal under different traffic patterns. Sometimes chunks were too coarse, sometimes too fine, causing resource utilization to fluctuate wildly.
Codex looked at weeks of real traffic data and wrote its own adaptive partitioning algorithm. It dynamically adjusts the chunking strategy based on actual traffic morphology.
Token generation speed increased by over 20%.
The model optimized the infrastructure running itself; AI is making itself run faster.
The overall reconstruction of the inference system, combined with the model participating in its own optimization, brought about this result when stacked together.
OpenAI says this is "a step towards a new way of getting work done with computers."
But when models have already started optimizing the infrastructure they run on—
Just how far has this step gone?
One More Thing
With GPT-5.5, OpenAI expects the release cadence for models to accelerate going forward.
"We see quite significant progress in the short term and extremely significant progress in the medium term.
I think progress in the past few years has been unexpectedly slow."
These words were spoken by Chief Scientist Jakub Pachocki during a conference call with reporters.
Reference Links:[1] https://openai.com/index/introducing-gpt-5-5/[2] https://x.com/firstadopter/status/2047378435555651856?s=20
— End —