Karpathy Slept, AI Ran 100 Experiments for Him

Karpathy is at it again.

Yesterday, he posted on X: "The feeling of the post-AGI era is... I didn't touch anything. Went to the sauna."

And he actually went to the sauna.

When he returned, AI had already run 100 training experiments on his GPU, automatically modifying code, training models, evaluating results, keeping improvements, and discarding failures, all completely unattended.

This project is called autoresearch.

630 lines of Python code, one GPU, one Markdown file.

That's it.

How It Works

The core idea of autoresearch is extremely simple: Give AI a real model training environment and let it conduct experiments itself.

Specifically, the entire project consists of only three key files:

prepare.py: Data preparation, tokenizer training, and evaluation functions. Read-only; AI cannot touch this.
train.py: Model architecture, optimizer, and training loop. The only file AI is allowed to modify.
program.md: Instructions for the AI. It tells the AI what the goal is, what the rules are, and how to record results.

The training budget is fixed at 5 minutes.

Regardless of the GPU used or how the AI modifies the code, every experiment runs for exactly 5 minutes.

The benefit of this approach is that all experiments are directly comparable. Changed model size? Comparable. Changed batch size? Comparable. Changed optimizer? Still comparable.

12 experiments per hour, 100 experiments while you sleep.

There is only one evaluation metric: val_bpb (validation bits per byte), where lower is better. This metric is independent of vocabulary size, so even if the AI modifies architecture related to the tokenizer, the results remain fair.

Never Stop

What is most impressive is a passage in program.md:

Never stop. Once the experiment loop starts, do not stop to ask a human if it should continue. The human might be sleeping or have left the computer; they expect you to work indefinitely until manually stopped. You are autonomous. If you run out of ideas, think harder.

"If you run out of ideas, think harder."

This sentence reads with a strange sense of power. It describes not an automated script, but a new kind of working relationship: Humans set goals and boundaries, then leave. The AI figures out how to proceed on its own.

The experiment loop process is as follows:

Check current Git status
Modify train.py to try a new idea
Git commit
Run 5 minutes of training, redirecting output to logs
Read results
If val_bpb decreases (improves), keep the change
If no improvement, git reset to revert
Repeat

Advance on improvement, retreat on failure.

Loop forever.

If failures continue consecutively, the AI must come up with a new direction on its own.

Karpathy wrote in the instructions: "Re-read the code, read papers, try combining previously near-successful approaches, try more radical architectural changes."

From Writing Code to Writing Markdown

Karpathy wrote an intriguing sentence in the README: The core idea is that you no longer touch Python files directly like a traditional researcher; the object of your programming becomes program.md, the Markdown file providing context to the AI Agent.

You no longer write Python. You write Markdown.

The traditional AI research workflow is: Researcher thinks of an idea → writes code to implement → runs experiments → checks results → thinks of the next idea. Each cycle might take a day, a week, or even a month.

The workflow has now changed: The researcher writes a program.md describing goals, constraints, and strategies → The AI modifies code, runs experiments, checks results, and thinks of the next idea on its own. Each cycle takes 5 minutes.

The researcher transforms from an "executor" to a "commander".

Looking at this, it aligns logically with the /loop feature of Claude Code discussed in our article yesterday, Claude Code launches /loop infinite loop, one computer becomes countless crayfish. /loop allows AI to autonomously cycle in the development domain, while autoresearch allows AI to autonomously cycle in the research domain.

At a fundamental level, it is the same thing: Closed-loop feedback.

The Elegance of Simplicity

Karpathy's design philosophy remains consistently simple.

In program.md, there is a rule called the "Simplicity Principle": If a small improvement introduces ugly complexity, it is not worth it. If deleting code improves val_bpb by 0.001? Keep it. If the improvement is negligible but the code is simpler? Keep that too.

If deleting code yields the same results, delete it.

This is not engineering obsession; it is a research philosophy: the best improvements are often the simplest ones.

The autoresearch project itself embodies this principle. No distributed training, no complex configuration systems, no multi-GPU support. Just one card, one file, one loop.

630 lines of code, done.

Thinking about it carefully, the most underestimated aspect of autoresearch may not be its training results, but what it reveals: The complete research cycle, from hypothesis to experiment to evaluation to iteration, is all packed into a single prompt file and 630 lines of code. We spent decades building research infrastructure, only to find that the infrastructure itself was the bottleneck.

In comparison, the AI Scientist project by Sakana AI is much more complex.

It attempts to cover the full scientific research lifecycle: generating hypotheses, designing experiments, writing papers, and even conducting peer reviews. AI Scientist-v2 even produced the first paper fully generated by AI and accepted by a workshop.

But Karpathy's approach is different.

He does not pursue comprehensiveness; he pursues extreme simplicity. One goal (lower val_bpb), one constraint (5 minutes), one feedback mechanism (keep or discard).

That is enough.

Not a New Idea

The concept of automated research is not new.

In 2024, Sakana AI released AI Scientist, attempting to let LLMs complete the full process from idea to paper. In 2025, they released version 2, using Agentic Tree Search to explore research directions.

The AI-Researcher team from HKU is also doing something similar: multi-agent collaboration covering the entire scientific research cycle.

But these projects are very "heavy".

The difference with autoresearch is that it does not attempt to simulate the human scientific research process. It does something more akin to an evolutionary algorithm: random mutation (modifying code) → fitness evaluation (running 5 minutes of training) → natural selection (keeping or discarding) → repeat.

No papers, no abstracts, no peer reviews. Just naked: modify code, look at numbers, keep the good, throw away the bad.

Karpathy said on X two months ago: "As a programmer, I have never felt so behind. This profession is being radically restructured." He later revealed that his coding style had flipped from "80% manual + 20% Agent" to "80% Agent + 20% manual fine-tuning".

autoresearch might be his response: Since I can't keep up, let the AI run itself.

The project was recommended by the official account on GitHub in less than 24 hours, and the community reaction was direct. Someone immediately forked a macOS version adapted for Apple Silicon, and others began discussing whether the same pattern could be applied to other fields: drug discovery, materials science, compiler optimization...

This exactly demonstrates that the greatest value of autoresearch lies not in its experimental results, but in the paradigm it displays: Give AI a real environment, a quantifiable metric, and an infinite loop, then step aside.

Sleeping is Research

Returning to Karpathy's tweet: "The feeling of the post-AGI era is... I didn't touch anything."

Of course, this is just a joke. autoresearch is far from AGI. It can only optimize within a highly constrained space (a single Python file, one evaluation metric). It will not propose disruptive new theories, discover new physical laws, or even write a passing paper.

But it demonstrates a possibility.

If AI can continuously improve a model in a 5-minute loop, what if the loop time is extended to a day, a week, a month? What if the search space expands from one file to an entire codebase? What if the evaluation metric expands from val_bpb to multiple dimensions?

The end of this direction is true automated scientific research.

Yesterday we wrote an article called The Essence of Autonomy is the Loop, analyzing from a cybernetics perspective why feedback loops are the foundation of autonomy.

autoresearch is a perfect case study of this theory:

Perception (reading val_bpb) → Judgment (is it improved?) → Action (modify code) → Perception again.

Wiener's thermostat, Karpathy's autoresearch, Claude Code's /loop. Different forms, same essence.

All are loops.

How to Use

If you have an NVIDIA GPU, you can run it directly:

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync
uv run prepare.py

Then open Claude Code (or any coding agent) and tell it: "Look at program.md and start experiments."

It will start running on its own. You can go to sleep.

Someone has already forked a macOS version (miolini/autoresearch-macos) supporting Apple Silicon.

The Future of Researchers

Karpathy used autoresearch to demonstrate a point: The core skill of future AI researchers may no longer be writing PyTorch code, but writing program.md.

Define goals, set constraints, describe strategies, then let go.

This is actually very similar to management. A good manager is not the person who does things the fastest, but the one who is best at defining problems and setting boundaries.

autoresearch turns AI research into a management task.

You are not managing people; you are managing Agents. You are not writing code; you are writing Markdown. Your output is not models; it is experiment logs.

Of course, this will not make AI researchers disappear. Just as /loop will not make programmers disappear. But the work content will change.

Previous research: Think of idea → Write code → Wait for results → Think of next idea.

Future research: Write program.md → Go to sleep → Wake up and check logs → Modify program.md.

Karpathy's own description is:

"The feeling of post-AGI is... I didn't touch anything."