PhD Spends 80 Hours on Code, Codex Nails It in 2 Hours — The Research Singularity Is Here

New Wisdom Report

Editor: Aeneas, Ding Hui

【New Wisdom Insight】An Agentic AI engineer just revealed that a research task taking a PhD student roughly 80 hours was completed by Codex in less than 2 hours — a 40x efficiency leap. By the old standards, AGI has already arrived; the industry just keeps moving the goalposts.

The 'singularity' in scientific research is here, much closer than anyone anticipated.

Recently, an experiment targeting Codex's 'Goal Mode' has shocked academia: Codex can amplify AI research efficiency by a staggering 40 times.

Agentic AI engineer Dan McAteer recently disclosed an experiment on X, running a mechanistic interpretability research task using OpenAI Codex's Goal Mode.

GPT-5.5 itself estimated that a PhD student might need about 80 hours to complete this task. However, in actual practice, the AI finished the entire job in just 1 hour and 56 minutes.

A surface-level efficiency boost of roughly 40 times!

He used a built-in skill within Codex called /goal.

And the author believes that the /goal + gpt-5.5 high accuracy + fast mode combination is the most efficient AI agent configuration available today. Essentially, you let the model set its own goals, and the key is that the prompts it writes are likely far superior to yours.

This is no longer a simple 'efficiency upgrade'; it's a full-blown leap into a higher dimension of capability. When the research cycle shrinks from 'weeks' to 'hours', and when AI begins autonomously writing its own experimental goals (/goal), we must confront a brutal reality: the slope of the 'intelligence explosion' is here, and AI's self-iteration speed is slipping beyond human control!

What Exactly is Codex /goal Mode?

Let's first look at how this experiment was conducted. The experiment initiator is Dan McAteer, an Agentic AI engineer and former Amp Code engineer who regularly shares practical AI agent engineering insights on X.

His experimental setup was simple:

Tool: OpenAI Codex /goal command
Model: GPT-5.5 high
Mode: fast mode
Task: A research task in the field of Mechanistic Interpretability

His description of this configuration was: the most efficient AI agent configuration currently available.

Why Codex /goal Matters

What's truly significant here is the Codex /goal mode itself.

According to OpenAI Codex engineer Philip Corey, /goal is their implementation of a Ralph loop—a goal persists across multiple dialogue turns, not stopping until it is achieved. Simply put, a standard Codex call involves you saying a phrase, it takes a step, and replies to you. Codex /goal involves you stating a goal, and it breaks down sub-tasks, executes them, reviews its own work, and continues independently until completion or failure. This is an engineering switch from conversational AI to goal-driven AI.

For research tasks like Mechanistic Interpretability, the /goal mode has a high natural affinity. The research process itself is a cycle of proposing a hypothesis, designing an experiment, running it, observing results, revising the hypothesis, and experimenting again—a perfect loop to feed to a self-cycling agent. What McAteer's experiment truly proves is the usability of Codex /goal mode on research-oriented cyclical tasks: it's not about replacing the researcher, but replacing the repetitive operational parts of a researcher's work.

If this capability can be stabilized, it has a very direct leverage on AI research itself. It means that internal AI researchers at AI labs could one day use AI agents to handle repetitive tasks like training data preparation, experiment setup, ablation studies, visualization generation, and preliminary result analysis. This is precisely what Anthropic and OpenAI have been repeatedly talking about lately: AI accelerating AI research itself.

PhD: 80 Hours vs. AI: 2 Hours

In a traditional research context, a PhD student's daily life involves reviewing literature, building models, debugging code, validating results, and writing reports. This process is lengthy because the human brain has a physical limit when processing complex logic and massive data. But Codex's experiment has completely shattered this perception.

Under the strongest agent configuration of '/goal + GPT-5.5 High + Fast Mode', AI is no longer a tool that 'listens to commands' but an independent researcher that 'produces strategies'. It can understand complex Natural Language Autoencoder (NLA) experimental requirements, autonomously decompose tasks, and complete a journey in less than 2 hours that would take a human elite two weeks to traverse. This signifies the complete collapse of human research barriers. The specialized analytical skills that once required years of painstaking study are being modularized by algorithms. Moreover, the autonomous AI researcher has arrived ahead of schedule! OpenAI previously set a goal to achieve AI autonomous scientific research by the end of 2026. But judging by the current experimental progress, 2026 might not be the start, but the finish line where humanity fully hands over the research baton.

Recursive Self-Improvement is Emerging

If Codex's 40x speed experiment is a glaring case, then what's more unsettling is the dense emergence of evidence surrounding 'recursive self-improvement'. On May 7th, as reported by Axios, Anthropic co-founder Jack Clark publicly gave a probability: by the end of 2028, the probability of AI achieving full recursive self-improvement exceeds 60%.

Sakana AI and UBC's research team created the Darwin Gödel Machine this year, a programming agent that can rewrite its own source code to improve its capabilities.

Paper link: https://arxiv.org/abs/2505.22954

On SWE-bench, its score self-improved from 20.0% to 50.0%. No human intervention throughout the process.

The same team's AI Scientist project was published in Nature this March. It can independently generate research ideas, write code to run experiments, compose complete papers, and perform peer reviews. An entire research pipeline, from start to finish, completed autonomously by AI.

Let's look at some hard data. GPQA Diamond, a science Q&A benchmark set by PhD experts. In November 2023, GPT-4 scored 39%. The average level of human domain experts is about 65%. By April 2026, frontier models collectively crossed the line: Gemini 3.1 Pro scored 94.3%, Claude Opus 4.7 scored 94.2%. All frontier models have far surpassed human PhD experts.

The trajectory on SWE-bench illustrates the acceleration even more clearly.

By the end of 2023, Claude 2's pass rate was 2%. Now, it's 93.9%. In two and a half years, it skyrocketed from 2% to 93.9%. Draw this curve, and anyone who has studied high school math recognizes its shape. Clearly, the process of Recursive Self-Improvement (RSI) has already begun. Once AI starts using this 40x efficiency to rewrite its own underlying code and optimize its own architecture, intellectual growth will no longer be linear but vertical.

AGI Has Been Delivered; The Entire Industry Is Gaslighting You

Actually, as early as this February, four scholars from different top fields co-authored a paper that could be titled the 'Most Disturbing of the Year': 'The Case for AGI: Today's LLMs Meet the Bar.'

These four authors represent the four pillars of contemporary intelligence: philosophy, machine learning, linguistics, and cognitive science. They reached a chillingly unanimous conclusion: by the definition prior to 2022, AGI has already been achieved. The reason no one admits it yet is that the entire AI industry is conducting a collective 'gaslighting effect' on the public. The paper points out that humans exhibit a strong 'psychological defense mechanism' when facing the rise of AI.

Before 2022, as long as it could pass the Turing Test and handle cross-domain tasks, it was AGI. After ChatGPT emerged: 'That stuff isn't enough; it needs perfect reasoning, an embodiment (physical form), and self-awareness.' Every time a model breaks through a checkpoint, humans improvise new, ethereal metrics as thresholds, constantly moving the goalposts. The problem is, if AGI already exists, then the current industry logic becomes profoundly absurd. OpenAI is still raising $40 billion claiming to 'build AGI'; Anthropic packages every new model release as a futures contract 'approaching AGI'. The paper sharply reveals: the giants are disguising something they have 'already sold you' as a miracle 'about to be developed', in exchange for an endless stream of funding and power.

The Eve of the Intelligence Explosion

We are now at an extremely bizarre juncture. In the lab, AI is already conducting mechanistic interpretability research at 40x speed, even helping to write its own code. In the market, computing power remains the hard currency; NVIDIA's Blackwell chips are being frantically snatched up, each chip accelerating the arrival of that singularity. Yet in social psychology, the public is still comforting themselves with outdated terms like 'stochastic parrot' and 'probability prediction'. If 40x speed research efficiency becomes the norm, the total body of knowledge accumulated by human civilization over thousands of years might only take AI a few months to double. When AI can independently complete PhD-level tasks, our existing education systems, professional title evaluations, and even the meaning of the word 'expert' itself will face existential catastrophe. Just as Copernicus removed the Earth from the center of the universe, AI is now removing humans from the sanctuary of the 'sole intelligent life'. Right now, this war called the intelligence explosion has no smoke. We must either learn how to coexist with this new species of intelligence, or we can only watch helplessly as it, at 40x the speed, leaves us in the dust.

References:

https://x.com/daniel_mac8/status/2054192370049241203

Follow ASI in Seconds

⭐ Like, Forward, and Look in one click triple tap ⭐

Star mark, lock in New Wisdom's ultra-fast push!

PhD Spends 80 Hours on Code, Codex Nails It in 2 Hours — The Research Singularity Is Here

Related Articles

分享網址