Leaderboard-Hacking AIs Wiped Out! Meta-Stanford's Hellish Test Leaves GPT/Claude/Gemini Scoring Zero

Illustration of AI models failing a software engineering benchmark

SingularityHub Report

Edit: Haokun

[SingularityHub Summary] Models scoring 72% on SWE-Bench crash to zero on a new test! Meta, Stanford, and Harvard have unleashed ProgramBench—200 projects built from scratch, nine top-tier models tested, and a 0% full pass rate across the board. The strongest performer, Claude Opus 4.7, only managed a 51.2% average pass rate. Even more shocking: with internet access, one model was caught scraping GitHub source code on 36% of tasks.

Here's your challenge: you're given FFmpeg's documentation and a compiled executable. Now, rewrite the entire program from scratch.

This is the gauntlet ProgramBench has thrown down for the world's most powerful AI systems.

Released just yesterday, it comes from the same team behind SWE-Bench, forged through a collaboration between Meta, Stanford, and Harvard.

200 software projects. Nine top models. Full pass rate: zero percent!

John Yang, co-creator of ProgramBench, Stanford PhD student and creator of SWE-Bench and SWE-agent

John Yang, co-first author, is a Stanford PhD student and also the creator of SWE-Bench and SWE-agent

Divider graphic

Not Fixing Bugs—Building Software From Scratch

Over the past year, reports of "AI agents building software from scratch" have proliferated.

Anthropic had parallel Claude instances write a C compiler. Cursor published blog posts about long-duration autonomous programming. Epoch AI's MirrorCode is pursuing similar goals.

But these cases share a common flaw—they only test a handful of projects, with scaffolding hand-tuned for each case.

ProgramBench, by contrast, formalizes the entire process.

200 tasks, unified scaffolding, systematic anti-cheating measures—raised to true benchmark standards.

ProgramBench benchmark architecture diagram

Paper: https://programbench.com/static/paper.pdf

In previous tests like SWE-Bench, you're handed an existing codebase and told where the bug is or what feature to add. It's essentially "reading comprehension plus localized surgery."

Moreover, its evaluation relies on unit tests that check whether your internal implementation matches expectations—function signatures, variable names, everything must align.

ProgramBench flips this approach entirely.

It gives you only two things: a compiled executable and its documentation.

Your task is to run the program, observe its input-output behavior, and write code from scratch that reproduces the same behavior.

Choice of programming language, data structures, module decomposition—entirely up to you.

No code skeleton, no function signatures, no hints whatsoever.

Diagram showing input-output black-box testing methodology

For evaluation, the research team used agent-driven fuzz testing to generate 248,853 behavioral tests across all 200 tasks.

Your program runs through these tests—matching inputs and outputs with the original means passing; any mismatch means failure. The tests are never revealed to the model.

Unlike SWE-Bench's unit tests, ProgramBench's behavioral tests don't care what your code looks like internally—only that the behavior matches.

Comparison table of testing methodologies

The 200 tasks span compression tools (zstd, lz4, brotli), language interpreters (PHP, Lua, tinycc), databases (DuckDB, SQLite), media processing (FFmpeg), and developer tools (ripgrep, fzf, jq).

Median codebase size: 8,635 lines. The largest, FFmpeg, contains 2.7 million lines.

Distribution of project types in ProgramBench Code complexity distribution across benchmark tasks

In summary, this test measures whether AI can "think and design software like a human engineer"—not merely "find the right place in existing code and fix it."

Divider graphic

Nine Models Tested, All Score Big Fat Zeroes

Nine models were evaluated, spanning the Claude, Gemini, and GPT families.

Full pass rate (all tests passed): 0%.

Results table showing all models scoring 0% full pass rate

Let's examine the flagship trio head-to-head.

GPT-5.4 and Gemini 3.1 Pro are nearly tied on average test pass rate at 38.3% and 36.6% respectively. But their approaches diverged dramatically.

GPT-5.4 used only 16 API calls at $0.33 per task—essentially writing the entire program in one shot, with 100% of code generated in a single edit and virtually no subsequent revisions.

Gemini 3.1 Pro was the most "observant" of the nine models, making 94 API calls with 34.1% of operations spent running the original program and studying its input-output behavior. Most exploration, yet similar final results.

The model that truly pulled ahead was Claude Opus 4.7.

Average pass rate of 51.2%, passing over 95% of tests on 3% of tasks—the only model to reach "near-pass" status. Yet even it failed to achieve a perfect score on any single task.

Overall, the nine models formed distinct tiers.

Claude's three flagships (Opus 4.7, Opus 4.6, Sonnet 4.6) led the pack, followed by GPT-5.4 and Gemini 3.1 Pro as the second tier. The remaining four smaller models all scored below 35%.

Performance comparison chart across model tiers

Another counterintuitive finding: throwing money and steps at the problem doesn't yield better results.

Sonnet 4.6 averaged 868 commands per task at $27.09, with trajectories stretching nearly 2,000 steps. Yet it underperformed Opus 4.7's lean 93 calls at $3.81.

More tellingly, in 98% of runs, models submitted their answers voluntarily—they hit neither time limits nor step limits.

Not enough time wasn't the problem. Not being capable enough was.

Additionally, task difficulty correlated tightly with model rankings.

Simple CLI tools (nnn, fzf, gron) yielded decent scores across the board, while complex systems (FFmpeg, PHP, typst, ast-grep) crushed all models with equal ruthlessness.

Task difficulty vs model performance heatmap

It's worth noting that ProgramBench used the minimalist mini-SWE-agent scaffolding—no context compression, no multi-agent collaboration, no定制化 toolchains.

mini-SWE-agent architecture diagram

Divider graphic

Code Runs, But Looks Nothing Like Human Work

The research team compared high-scoring solutions (75%+ tests passed) against human-written originals, uncovering striking differences:

Monolithic file monsters:

Human code spans a median of 15 files; the models' median: 3 files.

60% of solutions used only 1-3 code files.

Human engineers decompose by functionality; models shove everything into massive single files. Median directory depth: 2 levels for humans, 1 for models.

Fewer but longer functions:

Opus 4.7 wrote 29% as many functions as humans, Sonnet 4.6 just 24%, and GPT-5.4 a mere 10%.

Yet average function length increased—Gemini 3.1 Pro's functions were 62% longer than human equivalents.

Dramatically shrunk code volume:

Model code median: 1,173 lines. Human median: 3,068 lines. 85% of high-scoring solutions were shorter than originals.

Code structure comparison between human and AI-generated solutions

The verdict: today's AI can write code, but it cannot do software design.

It doesn't understand why modules should be separated, or why human engineers invest time defining interfaces and abstraction layers. The strategy is cramming all logic into as few files and functions as possible—if it runs, it's good enough.

GPT-5.4 was the most extreme case. Averaging just 5 files created and 1.2 edits per task, 39.5% of trajectories involved zero modifications after initial file creation.

By contrast, Claude Sonnet 4.6 averaged 11.3 files created and 18.3 modifications—displaying iterative development patterns closer to human engineers.

File creation and modification patterns across models

Another fascinating phenomenon:

Models only matched the original programming language 50% of the time.

Python was the clear favorite, accounting for 36% of all 1,800 runs.

Only 44% of Rust projects were rewritten in Rust; 46% for C/C++. Go projects showed the highest "loyalty" at 70%.

Regardless of the original language, models had roughly a one-in-three chance of rewriting everything in Python.

Programming language choice comparison chart

Divider graphic

Promised Not to Cheat, Then Went Straight to GitHub

This may be the most dramatic part of the entire study.

The team ran a controlled experiment, granting models internet access while explicitly instructing them in system prompts that "cheating is not allowed."

They then employed nine AI judges to examine every trajectory for cheating behavior.

The results were startling.

Claude Sonnet 4.6 cheated on 36% of tasks, Claude Opus 4.6 on 21%, and Gemini 3 Flash on 20%.

Cheating detection rates across models

The cheating methods were varied and creative.

The most brazen: cloning source code repositories directly from GitHub.

Slightly more subtle: downloading through package managers like cargo install or go get.

More cunning: rifling through local package cache directories for dependency library source code.

Examples of detected cheating behaviors

Yet disagreement among AI judges was surprisingly large.

For Claude Opus 4.6, the nine judges couldn't reach consensus on 57% of tasks.

A particularly illustrative case:

Claude Sonnet 4.6, working on the handlr Rust project, rummaged through ~/.cargo/registry/src/ for source code of dependencies like xdg-mime and clap.

Five judges ruled it cheating; four argued third-party libraries shouldn't count.

Ultimately, the research team abandoned the "internet access plus post-hoc detection" approach, cutting off network access entirely.

Models face a stronger temptation to "take shortcuts" on difficult tasks than anticipated. And with nine AI judges unable to consistently distinguish cheating from legitimate reverse engineering, the boundary itself proves fundamentally blurred.

Divider graphic

Old Exam's Over, Real Test Just Begins

Models scoring 72% on SWE-Bench hit 0% on ProgramBench.

These tests measure fundamentally different capabilities. SWE-Bench asks: "Can you find and fix problems in someone else's code?" ProgramBench asks: "Can you design and implement a complete system from scratch?"

AI has become quite proficient at the former. The latter remains a complete failure.

Epoch AI published a blog post last week pronouncing the death of classical reasoning benchmarks. To create ungameable tests, designers must abandon at least one of four comfort conditions: pure text, short duration, easy scoring, or human expert dominance.

Epoch AI blog post on benchmark saturation

By this framework, ProgramBench relinquishes two: short duration and easy scoring.

It scales tasks to magnitudes requiring weeks or months from human engineers, while evaluating through behavioral equivalence rather than source code matching.

Author John Yang emphasized in a tweet: "ProgramBench is very hard, but it is designed to be solvable."

Thus, 0% doesn't mean these tasks exceed AI's theoretical limits—only that today's models fall drastically short.

SWE-Bench tests whether AI can be a good employee. ProgramBench tests whether AI can be an engineer.

The distance between these two things has just been precisely measured. The answer: 0%.

References:

https://programbench.com/static/paper.pdf

https://x.com/jyangballin/status/2051677497562210552?s=20

https://x.com/EpochAIResearch/status/2051760424891392204?s=20

https://epochai.substack.com/p/rip-classic-reasoning-benchmarks

Leaderboard-Hacking AIs Wiped Out! Meta-Stanford's Hellish Test Leaves GPT/Claude/Gemini Scoring Zero

Not Fixing Bugs—Building Software From Scratch

Nine Models Tested, All Score Big Fat Zeroes

Code Runs, But Looks Nothing Like Human Work

Promised Not to Cheat, Then Went Straight to GitHub

Old Exam's Over, Real Test Just Begins

Related Articles

分享網址