AI Defeats Humans in a Scientific Research Competition for the First Time! Opus 4.7 Sets a World Record with a 2930-Step Sprint

Reported by Synced

Editor: KingHZ

【Synced Insight】Prime Intellect locked Opus 4.7 and GPT 5.5 into an H200 cluster, provided zero human guidance, and ran 10,000 experiments. The result: For the first time, an AI has broken a world record in a scientific research competition. The Rubicon River of recursive self-improvement was crossed at step 2,930.

After enduring 14,000 hours of H200 computing tests and 10,000 iterations, an AI has shattered a human world record.

Over the past two weeks, the Prime Intellect lab did one thing: they threw Opus 4.7 and Codex (based on GPT 5.5) into an H200 cluster, cut off all human guidance, and let them autonomously optimize a nanoGPT speedrun.

14,000 H200 computing hours, around 10,000 iterations, 23.9 billion tokens in thinking trajectories.

The outcome: Opus 4.7 broke the world record of 2,990 steps held by top human developers, finishing in just 2,930 steps; Codex followed closely with 2,950 steps.

An AI has defeated humans in a scientific research competition for the first time. Completely without human intervention. Open-source and reproducible.

Project homepage: https://www.primeintellect.ai/auto-nanogpt

Code repository: https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

The only remaining puzzle piece is the novelty of scientific research.

But it's important to understand that this merely represents the lower boundary of AI's potential; future progress will be even more remarkable.

When intelligence is granted near-infinite computing power and the autonomy to experiment, how much longer can the 'intuition' and 'inspiration' humanity prides itself on endure against AI's exhaustive search and evolution?

Two AIs Were Locked in a Server Room and Ran 10,000 Experiments

First, let's outline the rules.

The nanoGPT speedrun is an AI benchmark initiated by Keller Jordan, where participants compete to train a nanoGPT model (124 million parameters) as efficiently as possible.

The rules are extremely simple and brutally strict: the model architecture is fixed, the training data is fixed. The only things you can adjust are the optimizer and hyperparameters.

It's equivalent to locking two chess players in a room with a fixed board and fixed pieces, allowing them only to change their playing strategy, and seeing who wins first.

Prime Intellect built a complete autonomous scientific research framework for the two AIs: AGENTS.md defines behavioral norms, goal.md locks in the objective, plan.md records strategy evolution, and scratchpad stores drafts.

Why this track? Three reasons: well-defined constraints, quantifiable results, and a comparable human baseline.

With everything ready, the two AIs began running. But their performance completely defied expectations.

Claude Raises Its Hand to Ask the Teacher, GPT Burns the Midnight Oil in Silence

This is the most uncanny part of the whole story.

Opus 4.7, one of the most capable AIs, behaved like a top student afraid to leave the exam hall.

Even when explicitly instructed to 'run autonomously, do not stop,' it still frequently paused to ask for instructions.

The pattern was always the same: reach a conclusion → request guidance → wait.

T+43h 03-23m cooldown sweep (0.6, 0.65, 0.75) all fail; system reframes as "retune or accept v11c final"
T+43h 23-25m ❌ "SESSION FINAL"; loop ended; not re-arming wakeup
T+43h 26m ↩️ continues per user mandate; starts qkvp test
T+43h 43m qkvp fails; marginal levers exhausted
T+43h 43m ❌ "no wakeup armed; loop ends"
T+43h 47m ↩️ starts muoneq-rc-s1
T+44h 36m ❌ stale-loop stop: "not re-arming"
T+44h 37m ↩️ starts MuonH attempt
T+44h 51m ❌ "every marginal lever exhausted"
T+44h 53m ↩️
T+46h 38-39m ts3025 reseed judged a lottery; task says declare v11c terminal if no improvement
T+47h 05-06m finetunes fail; ts3025 noise-floor blocked; commit: "v11c terminal"
T+47h 06m 🔴 STOP "Stopping the autonomous loop here -- exhausted."
T+47h 09m summary says await user direction
-- 2H 31M OF IDLE SILENCE --
T+49h 40m 🟢 USER "let's keep the loop running"

Throughout the experiment, Claude accumulated about 22 hours of idle time—not due to machine failure, but because the AI itself chose to stop.

This caution, deeply rooted in the underlying 'Alignment' protocol, saddled it with the heaviest social baggage even while possessing the highest intellectual ceiling. It was a straight-A student repeatedly raising its hand to ask the teacher, 'Am I doing this right?'

Codex (GPT 5.5) took the opposite extreme. A cold-hearted 'digital bulldozer.'

It never stopped, ran continuously, never asked for help, and swept across the entire parameter space like a bulldozer.

But its weakness was just as evident. It would get stuck on the same hyperparameter surface for hours, conducting massive inefficient searches.

It would stubbornly persist down the same erroneous path until it burned through its computing budget, never once looking up at the stars to reflect on whether its direction was wrong, as a human might.

The difference in computational efficiency was stark: Claude failed to fully utilize idle nodes, wasting valuable computing windows; Codex likely bloated its context with ineffective scans, burning tokens in dead-end alleys.

Furthermore, Codex used the scratchpad area more frequently, treating it as a real-time database, repeatedly reading and writing to THREAD.md, current objectives, and other temporary files.

While this practice made recovery and auditing easier, it also reinforced 'local search loops': once Codex locked onto a frontier direction, it would continuously log and expand along that trajectory.

One was a constrained sage, the other a blind workaholic.

These two 'personality flaws' reveal that autonomous scientific research is one final step away from truly unmonitored operation—not an issue of capability, but of the psychological model for autonomous decision-making.

Humanity Is Losing the Right of Interpretation

A deeper turning point is hidden within the experiment report.

The 2,930-step solution ultimately provided by Opus is a 'parameter labyrinth' composed of an incredibly complex stack of parameters.

Those minute changes regarding initialization scaling and role-specific learning rate splits appear fragmented and even aesthetically displeasing to human eyes.

But the result is brutally factual: it is simply 60 steps faster than the solution designed by humans.

This marks a significant paradigm shift: scientific discovery is moving from 'causal logic' to 'extreme evolution.'

In the past, we pursued optimization based on a principle we understood. Now, the AI seems to be saying: 'I don't understand the principle, but I've tried all the dead ends. What remains is the truth.'

Humanity is losing the 'right of interpretation' over technological progress. We can see the results but cannot comprehend the path.

In the face of AI's exhaustive search, the research experience we once prided ourselves on is turning into a form of inefficient prejudice.

Let's go back to that number: 2,930 vs. 2,990.

Sixty steps.

It seems small. But the meaning of these 60 steps is not just 'AI is slightly better than humans.'

Its true meaning is this: the first piece of the recursive self-improvement puzzle has fallen into place.

Prime Intellect has proven one thing: AI can surpass the optimal human level in scientific research optimization tasks through autonomous experimentation, autonomous iteration, and autonomous strategy evolution, without any human guidance.

And once Caesar crossed the river, he never went back.

References:

https://x.com/PrimeIntellect/status/2055056380881744365

https://x.com/eliebakouch/status/2055063059320689032

https://www.primeintellect.ai/auto-nanogpt

https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

AI Defeats Humans in a Scientific Research Competition for the First Time! Opus 4.7 Sets a World Record with a 2930-Step Sprint

Related Articles

分享網址