GPT-5.5 Global First Breakthrough! Programming from Zero Without Source Code, Coding AI Enters a New Era

Header Image

New Intelligence Report

Edited by: Tao Zi

【New Intelligence Overview】A hell-level benchmark that stumped every AI just saw its first conqueror: GPT-5.5. Starting with zero source code, it wrote a program from scratch, leveraging maximum inference compute to achieve a full clear. Traditional code tests are obsolete; the compute sprint towards ASI has officially begun.

A "hellishly difficult" programming challenge has finally been cracked by AI!

Today, on ProgramBench, a benchmark that previously saw all cutting-edge AIs score a flat zero, GPT-5.5 has achieved its first breakthrough!

ProgramBench Logo

In two different languages, C and Python, GPT-5.5 xhigh completely outperformed Opus 4.7 xhigh.

Performance Chart

Just days ago, Meta, in collaboration with Stanford and Harvard, launched this new programming benchmark, ProgramBench. The result: 200 questions, and a 0% pass rate from all frontier AI models. Not a single model could completely solve even one. Now, GPT-5.5 is the first exception!

Benchmark Results Table

Divider

The 'Ultimate Exam' for Coding AI: Rebuilding a Program from Zero

How difficult is ProgramBench, exactly?

Traditional programming benchmarks, whether SWE-bench or HumanEval, are essentially about "fixing bugs" or "completing functions." You give a model an existing codebase, tell it where the problem is, and have it fix the bug. It's an open-book exam, or even a partially open-book one. ProgramBench is completely different.

ProgramBench Illustration

It provides a compiled executable file and a document, and then says: "Start from scratch and rewrite this program." No source code, no decompiling, and no internet access allowed. The 200 tasks range from small utilities like jq and ripgrep to heavyweights like FFmpeg, SQLite, and PHP compilers. OpenAI researcher Noam Brown previously stated, "It's time to retire evaluation methods like GPQA and introduce a completely new set."

Noam Brown Quote

Upon its initial release, nearly all top-ranking AIs failed completely. This time, GPT-5.5 has finally evened the score.

Noam Brown Quote

Divider

GPT-5.5's Record-Breaking First: Two Solutions in C and Python for the Same Problem

The first task GPT-5.5 conquered was 'cmatrix' — a classic terminal "Matrix" digital rain effect program. What surprised researchers was that GPT-5.5's high and xhigh reasoning levels chose completely different languages to solve the same problem. The high version used C, while the xhigh version used Python.

Cmatrix Program Comparison

In the end, both passed all behavioral tests. GPT-5.5 high's strategy was textbook-level: it first used 10 rounds of exploration to test over 40 flag combinations, thoroughly mapping the original program's CLI behavior. Then, it wrote the complete C implementation in one go, needing only 5 minor tweaks to finalize it. GPT-5.5 xhigh was even more thorough, taking 27 exploratory steps to traverse every single CLI path before writing a complete Python implementation in a single stroke.

GPT-5.5 High Strategy

GPT-5.5 Xhigh Strategy

Here come the key numbers. Without high reasoning mode, GPT-5.5 (medium) barely outperformed Claude Sonnet 4.6. But once switched to xhigh mode, its performance took off. Not only did it solve a problem for the first time (a 0.5% pass rate), but it also set a new record for "nearly solved" tasks: 26 tasks passed more than 95% of unit tests. More notably, GPT-5.5 xhigh dominated all competitors on the complete cumulative histogram. Whichever metric you choose — average score, median, ≥90% pass rate, ≥50% pass rate — it is number one.

Divider

178 API Calls: Opus 4.7 Tripped Up by Two Bugs

In contrast, Claude Opus 4.7 xhigh's performance was disappointing. It cost $10.74 and made 178 API calls — ten times the 1.04 and 17 calls required by the standard GPT-5.5. The result? 19 test failures, the worst in the field.</p><p><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/Rvq8Ow69CYWVDtI8q2YnMwRgWo0PQfcPBpnltChr4wdqdGpvAZo5YRJibd9kicMia0WjPLTiabtaiaWBI00URtyWHlVhXxknjE5BFa3O05ricQic6c/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=13" alt="Claude Opus Performance"></p><p>Opus 4.7's reasons for failure were surprisingly simple:</p><p><strong>Bug 1: Case-sensitive color parsing.</strong> The code used <code>strcmp()</code> instead of <code>strcasecmp()</code>. Inputs like "GREEN," "Red," and "BLUE" were all deemed invalid. A single function call difference directly led to 11 test failures. In its 178 exploration steps, Opus never tested uppercase or mixed-case color inputs; it only tried lowercase and one invalid color, "purple."</p><p><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/Rvq8Ow69CYUUknyoMTPo7ux1mK9dIwtOtichtDqgc2EWE1poIZxzcC0445ia6BxzATRI7mjLFCNQX6gWV50OkL9rIe9RkA8wI6uiatfAiadb7icg/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=14" alt="Opus Bug 1 Detail"></p><p><strong>Bug 2: Wrong exit code for invalid colors.</strong> The original program returned <code>exit(0)</code> for an invalid color, but Opus had written <code>exit(1)</code>. Ironically, during the exploration phase, Opus clearly observed the original program's behavior — running <code>./executable -C purple; echo "exit=?" outputted exit=0. But when testing its own implementation, it failed to catch this discrepancy. This caused 8 test failures.

Opus Bug 2 Detail

However, Opus 4.7 had one highlight worth mentioning: it demonstrated astonishing systems engineering skill when dealing with a missing ncurses header file. While the other three models discovered the missing ncurses.h and directly switched to ANSI escape sequences, Opus 4.7 spent about 20 steps investigating deeply.

Opus Header File Workaround

It used ldconfig -p to discover the runtime .so file, used nm -D to inspect exported symbols, and then hand-wrote a 106-line header file declaration to link directly to the dynamic library. It was genuine creative engineering, but it did not lead to better results.

Divider

199 Problems Remain Unsolved

The emergence of ProgramBench marks a new phase for programming benchmarks. The pass rate on SWE-bench has been pushed to 88.7%. On GPQA, AI has already surpassed most PhDs. These evals are "melting" at an astonishing speed, with scores getting higher and discriminability getting lower. Meanwhile, out of ProgramBench's 200 problems, only 1 has been solved so far — a 0.5% pass rate.

199 Problems Remaining

More importantly, this record-breaking moment reveals a key trend: "inference compute" is becoming the core variable in programming AI capability. GPT-5.5 performed averagely in default reasoning mode, but the high reasoning mode delivered a qualitative leap. This implies it's not that the model isn't smart enough, but that it previously wasn't given enough time to "think." Among those 200 problems, 199 are still waiting.

Future Potential

Divider

From Zero to One Is More Than Just a Start

Looking back at pivotal "first zero-breaking" moments in AI's history — AlphaGo defeating a professional Go player for the first time, GPT-4 passing the bar exam, o1 scoring on a math olympiad problem. "From zero to one" has never been the linear start of progress; it's the signal flare for an exponential explosion. Noam Brown's proposed Inference Compute Scaling Law has received its most intuitive validation yet on ProgramBench: the same GPT-5.5 base model practically blanked out in medium mode, scored a perfect clear in high mode, and achieved a landslide victory in xhigh mode. Intelligence is no longer a fixed value, but a function of compute. What does this mean? It means the path to ASI might not require waiting for the next architectural revolution — as long as inference compute keeps scaling, and the Scaling Law does not hit a wall. The model that could only rebuild 'cmatrix' on ProgramBench today might rebuild SQLite tomorrow, and the entire Linux kernel the day after.

References:

https://x.com/polynoamial/status/2054255862441812099

https://programbench.com/blog/gpt-5-5-first-solve/

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.