New Intelligence Report
Edited by: Tao Zi
【New Intelligence Overview】A hell-level benchmark that stumped every AI just saw its first conqueror: GPT-5.5. Starting with zero source code, it wrote a program from scratch, leveraging maximum inference compute to achieve a full clear. Traditional code tests are obsolete; the compute sprint towards ASI has officially begun.
A "hellishly difficult" programming challenge has finally been cracked by AI!
Today, on ProgramBench, a benchmark that previously saw all cutting-edge AIs score a flat zero, GPT-5.5 has achieved its first breakthrough!
In two different languages, C and Python, GPT-5.5 xhigh completely outperformed Opus 4.7 xhigh.
Just days ago, Meta, in collaboration with Stanford and Harvard, launched this new programming benchmark, ProgramBench. The result: 200 questions, and a 0% pass rate from all frontier AI models. Not a single model could completely solve even one. Now, GPT-5.5 is the first exception!
The 'Ultimate Exam' for Coding AI: Rebuilding a Program from Zero
How difficult is ProgramBench, exactly?
Traditional programming benchmarks, whether SWE-bench or HumanEval, are essentially about "fixing bugs" or "completing functions." You give a model an existing codebase, tell it where the problem is, and have it fix the bug. It's an open-book exam, or even a partially open-book one. ProgramBench is completely different.
It provides a compiled executable file and a document, and then says: "Start from scratch and rewrite this program." No source code, no decompiling, and no internet access allowed. The 200 tasks range from small utilities like jq and ripgrep to heavyweights like FFmpeg, SQLite, and PHP compilers. OpenAI researcher Noam Brown previously stated, "It's time to retire evaluation methods like GPQA and introduce a completely new set."
Upon its initial release, nearly all top-ranking AIs failed completely. This time, GPT-5.5 has finally evened the score.
GPT-5.5's Record-Breaking First: Two Solutions in C and Python for the Same Problem
The first task GPT-5.5 conquered was 'cmatrix' — a classic terminal "Matrix" digital rain effect program. What surprised researchers was that GPT-5.5's high and xhigh reasoning levels chose completely different languages to solve the same problem. The high version used C, while the xhigh version used Python.
In the end, both passed all behavioral tests. GPT-5.5 high's strategy was textbook-level: it first used 10 rounds of exploration to test over 40 flag combinations, thoroughly mapping the original program's CLI behavior. Then, it wrote the complete C implementation in one go, needing only 5 minor tweaks to finalize it. GPT-5.5 xhigh was even more thorough, taking 27 exploratory steps to traverse every single CLI path before writing a complete Python implementation in a single stroke.
Here come the key numbers. Without high reasoning mode, GPT-5.5 (medium) barely outperformed Claude Sonnet 4.6. But once switched to xhigh mode, its performance took off. Not only did it solve a problem for the first time (a 0.5% pass rate), but it also set a new record for "nearly solved" tasks: 26 tasks passed more than 95% of unit tests. More notably, GPT-5.5 xhigh dominated all competitors on the complete cumulative histogram. Whichever metric you choose — average score, median, ≥90% pass rate, ≥50% pass rate — it is number one.
178 API Calls: Opus 4.7 Tripped Up by Two Bugs
In contrast, Claude Opus 4.7 xhigh's performance was disappointing. It cost $10.74 and made 178 API calls — ten times the 1.04 and 17 calls required by the standard GPT-5.5. The result? 19 test failures, the worst in the field.</p><p><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/Rvq8Ow69CYWVDtI8q2YnMwRgWo0PQfcPBpnltChr4wdqdGpvAZo5YRJibd9kicMia0WjPLTiabtaiaWBI00URtyWHlVhXxknjE5BFa3O05ricQic6c/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=13" alt="Claude Opus Performance"></p><p>Opus 4.7's reasons for failure were surprisingly simple:</p><p><strong>Bug 1: Case-sensitive color parsing.</strong> The code used <code>strcmp()</code> instead of <code>strcasecmp()</code>. Inputs like "GREEN," "Red," and "BLUE" were all deemed invalid. A single function call difference directly led to 11 test failures. In its 178 exploration steps, Opus never tested uppercase or mixed-case color inputs; it only tried lowercase and one invalid color, "purple."</p><p><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/Rvq8Ow69CYUUknyoMTPo7ux1mK9dIwtOtichtDqgc2EWE1poIZxzcC0445ia6BxzATRI7mjLFCNQX6gWV50OkL9rIe9RkA8wI6uiatfAiadb7icg/640?wx_fmt=png&from=appmsg&watermark=1&tp=webp&wxfrom=5&wx_lazy=1#imgIndex=14" alt="Opus Bug 1 Detail"></p><p><strong>Bug 2: Wrong exit code for invalid colors.</strong> The original program returned <code>exit(0)</code> for an invalid color, but Opus had written <code>exit(1)</code>. Ironically, during the exploration phase, Opus clearly observed the original program's behavior — running <code>./executable -C purple; echo "exit=?" outputted exit=0. But when testing its own implementation, it failed to catch this discrepancy. This caused 8 test failures.
However, Opus 4.7 had one highlight worth mentioning: it demonstrated astonishing systems engineering skill when dealing with a missing ncurses header file. While the other three models discovered the missing ncurses.h and directly switched to ANSI escape sequences, Opus 4.7 spent about 20 steps investigating deeply.
It used ldconfig -p to discover the runtime .so file, used nm -D to inspect exported symbols, and then hand-wrote a 106-line header file declaration to link directly to the dynamic library. It was genuine creative engineering, but it did not lead to better results.
199 Problems Remain Unsolved
The emergence of ProgramBench marks a new phase for programming benchmarks. The pass rate on SWE-bench has been pushed to 88.7%. On GPQA, AI has already surpassed most PhDs. These evals are "melting" at an astonishing speed, with scores getting higher and discriminability getting lower. Meanwhile, out of ProgramBench's 200 problems, only 1 has been solved so far — a 0.5% pass rate.
More importantly, this record-breaking moment reveals a key trend: "inference compute" is becoming the core variable in programming AI capability. GPT-5.5 performed averagely in default reasoning mode, but the high reasoning mode delivered a qualitative leap. This implies it's not that the model isn't smart enough, but that it previously wasn't given enough time to "think." Among those 200 problems, 199 are still waiting.
From Zero to One Is More Than Just a Start
Looking back at pivotal "first zero-breaking" moments in AI's history — AlphaGo defeating a professional Go player for the first time, GPT-4 passing the bar exam, o1 scoring on a math olympiad problem. "From zero to one" has never been the linear start of progress; it's the signal flare for an exponential explosion. Noam Brown's proposed Inference Compute Scaling Law has received its most intuitive validation yet on ProgramBench: the same GPT-5.5 base model practically blanked out in medium mode, scored a perfect clear in high mode, and achieved a landslide victory in xhigh mode. Intelligence is no longer a fixed value, but a function of compute. What does this mean? It means the path to ASI might not require waiting for the next architectural revolution — as long as inference compute keeps scaling, and the Scaling Law does not hit a wall. The model that could only rebuild 'cmatrix' on ProgramBench today might rebuild SQLite tomorrow, and the entire Linux kernel the day after.
References: