GPT-5.2 Runs Non-stop for 7 Days, Building a Chrome-Level Browser with 3 Million Lines of Code

New Smart Yuan Report

Editors: Dinghui, Allen

[New Smart Yuan Overview] How long can a large model keep writing code? An hour? A day? Or, like most AI coding tools, does it end the conversation once a task is completed? Michael Truell, CEO of Cursor, decided to conduct an extreme stress test!

Michael Truell let GPT-5.2 inside Cursor run continuously for a full week.

Not an hour, not a day, but sleeping without rest, day and night, writing code continuously for 168 hours.

The result?

3 million lines of code. Thousands of files.

AI completely built a brand new browser from scratch.

And it is a Chrome-like browser.

HTML parsing, CSS layout, text rendering, and a self-developed JavaScript virtual machine—all written by the AI itself.

Michael Truell casually tweeted: It basically runs! Simple web pages can be rendered quickly and correctly.

How Long Can a Model Run?

Traditional AI coding tools, like Github Copilot and early IDEs, operate in a Q&A mode.

Limited conversation length, limited context, limited task complexity.

Later, so-called Agentic programming emerged—tools like Claude Code, Cursor Agent, and Windsurf allow AI to autonomously execute multi-step tasks, read files, run commands, and fix errors.

This is a significant improvement, but in most cases, tasks are still measured in minutes, or perhaps a few hours.

AI completes a function, humans review it, and then move on to the next task.

But no one has tried to let a model run continuously for a week.

Until GPT-5.2.

The Cursor team let GPT-5.2 run continuously for a full week—not intermittently, but working continuously.

During this week, it:

Wrote over 3 million lines of code

Created thousands of files

Executed trillions of tokens

Built a complete browser rendering engine from scratch

How long can a model actually run?

The answer is: Theoretically, infinitely.

As long as the infrastructure is stable and the task is clear enough, AI can continue to work—sleepless, tireless, 24/7, all year round.

Like the "cyber labor" of the Australian sheep shearer.

But in reality, the "stamina" of different models varies hugely.

The context window is the first threshold.

Early GPT-3.5 had only a 4K token context, meaning conversations would become forgetful if they got slightly long.

Claude 3 introduced a 200K context, GPT-4 Turbo followed with 128K, and Gemini 1.5 Pro even claims to support 1 million tokens.

But context length is just a theoretical value—the real test is whether the model can maintain consistency, focus, and execution ability in long tasks.

The Cursor team discovered key differences in their experiments.

In Cursor's official blog post, the team found key differences in the experiment:

GPT-5.2 can work autonomously for long periods, following instructions precisely, maintaining focus without deviation;

Claude Opus 4.5 tends to end early, take shortcuts, and frequently return control to the user;

GPT-5.1-Codex, although trained specifically for coding, has inferior planning capabilities compared to GPT-5.2, making it prone to interruption.

To put it more simply: Opus is like an impatient intern who works for a bit and then asks, "Is this okay? I'm handing it in now";

While GPT-5.2 is like a seasoned senior engineer who buries their head and works to the end once the task is clearly explained.

This is why Cursor officially claims: GPT-5.2 is the frontier model for handling long-running tasks.

Not just a browser.

Cursor also revealed other experimental projects currently running: Java LSP, a Windows 7 emulator, and an Excel clone.

The data is夸张; the AI continuously wrote 550,000 lines of code, 1.2 million lines, and 1.6 million lines respectively. (By the way, Excel has a bit more code than Windows, interesting).

Multi-Agent System Collaboration

A model writing 3 million lines of code in a week, note that it is writing continuously without human intervention!

Obviously, this is not just one model "fighting alone"; how was it done?

The Cursor team revealed their secret weapon: Multi-Agent System.

Initially, they tried to let all Agents collaborate equally by synchronizing state through shared files. The result found:

Agents would hold locks too long or simply forget to release them. Twenty Agents' speeds dropped to the effective throughput of just two or three Agents.

This is very much like common problems in human teams: too many meetings, high communication costs, and unclear boundaries of responsibility.

The final effective solution was a hierarchical architecture:

Planners: Continuously explore the codebase, create tasks, and make high-level decisions

Workers: Focus on completing specific tasks, unconcerned with the global picture, and move to the next one after submission

Reviewers: Judge if each iteration is qualified and decide whether to proceed to the next stage

This is almost the organizational structure of a human software company: Product Managers/Architects plan, Programmers execute, and QA reviews.

But the difference is—this involves hundreds or thousands of Agents working simultaneously.

The Cursor team implemented hundreds of Agents working collaboratively on the same codebase for weeks with almost no code conflicts.

This means AI has already learned the collaborative tacit understanding that human teams take years to develop.

The Browser's "Moat"

Is Much Deeper Than You Think

If you hear an evaluation like "Isn't it just software that displays web pages?", every engineer who has worked on a browser kernel would probably smile wryly.

In the hierarchy of computer science, the difficulty of writing a browser kernel by hand is second only to writing an operating system.

To give you a concept of this 3 million lines of code, we need to take a look at Google's Chromium (Chrome's open-source parent).

As one of the pinnacles of human software engineering, Chromium's code volume has long since broken through 35 million lines.

It is not just software; essentially, it is an "operating system disguised as an application".

What exactly is GPT-5.2 challenging?

First is the "Chaos Theory" of CSS.

Web page typography is never just a simple matter of stacking blocks.

The CSS standard is full of various historical quirks, cascade rules, and complex inheritance logic.

A former Firefox engineer once used a metaphor: Implementing a perfect CSS engine is like simulating a universe where physical laws change at will. Modifying a property of a parent element can cause the layout of thousands of child elements to collapse instantly.

Second is "Virtual Machines within Virtual Machines".

This time, the AI not only wrote the interface but also a JS virtual machine.

JavaScript code running on modern web pages needs memory management, garbage collection (GC), and security sandboxes.

If handled slightly poorly, a webpage can eat up all your memory or let a hacker penetrate the browser and take over your computer.

Most critically, it chose Rust.

The Rust language is known for "uncompromising safety"; its compiler is like an extremely neurotic examiner.

When human engineers write business logic, they often have to spend half their time "arguing" with the compiler, dealing with the Borrow Checker and lifetime issues.

The AI not only needs to understand the business, but also has to keep this "examiner" from finding faults on the scale of millions of lines of code.

Being able to chew down these hard bones within seven days and make them work together is no longer just "writing fast"; it means machines are beginning to possess top-level architectural control.

When AI Can "Endure Loneliness"

But the real explosive point of this news is not actually the browser itself, but that "Uninterrupted" aspect.

This is the watershed of AI evolution.

Before this, the AI coding tools we were familiar with (like early Copilot) worked like this: you write a function header, it completes five lines of code; you send a command, it generates a script.

Their memory is fragmented, and their attention is short-lived.

Once the task gets slightly more complex, such as "refactor this module", they often attend to one thing and lose sight of another, fixing one part while breaking another, and finally, humans have to clean up the mess.

But this time is different. This is a victory for "Long-term Tasks".

These 3 million lines of code are distributed across thousands of files.

When the AI writes the 3 millionth line, it must still "remember" the architectural rules set in the first line of code;

When the rendering engine and the JS virtual machine conflict, it must be able to trace back tens of thousands of lines of code to find the source of the Bug.

In these 168 hours, GPT-5.2 definitely wrote Bugs.

But it didn't stop to report errors and wait for humans to feed it answers; instead, it read the error logs itself, debugged itself, refactored itself, and then continued moving forward.

This autonomous loop of "Write-Run-Fix" was once the moat we human engineers were most proud of.

Now, this moat has been filled.

We are witnessing a qualitative change of AI from "chat companion" to "digital laborer".

Before, we commanded AI to do "tasks", like "write a Snake game";

Now, we command AI to do "projects", like "build a browser".

The Spiral of Silence

Although the maturity of this AI browser is still a long way from Chrome, it proves the feasibility of the path.

When computing power can be converted into extremely complex engineering implementation capabilities, the marginal cost of software development will approach zero.

The most shocking part of this experiment is actually not the webpage rendered on the screen, but that progress bar that ran silently in the background for a whole seven days.

It sleeps not, rests not, unhurried and unflustered, building the cornerstone of the digital world at a speed of thousands of characters per second.

Perhaps we need to re-examine the definition of "creation".

Only when tools start solving problems alone in the middle of the night do we realize that it is no longer just a tool, but our companion.

From the Australian Uncle's "Cyber Labor"

To AI Long-term Tasks

The Australian sheep shearer who drove Silicon Valley crazy with 5 lines of code essentially did just one thing: made sure AI wouldn't stop until the goal was reached.

As for what commands were written in Prompt.md, that is not the point.

Just like this extreme stress test conducted by the CEO of Cursor today, the goal is to build a Chrome, build a Windows, and develop an Excel; as long as the goal is not completed, the AI must keep running. Going back to the initial question:

How long can an AI work on its own?

The physical answer is infinite. As long as you have enough computing power, stable infrastructure, and clear task definitions, AI can run indefinitely.

But more importantly, this changes the economics of software development.

The main cost of traditional software development is manpower and time.

A 10-person team developing a complex project may take 6 months to several years. The monthly labor cost may be hundreds of thousands to millions.

Now, AI can complete work that originally took months within a week.

The cost may just be some token fees. Emad Mostaque (former CEO of Stability AI) guessed that the Cursor browser project may have consumed about 3 billion tokens.

He also has an idea: How many tokens would it take to rewrite a Windows-level operating system? What are the costs?

Tokens are getting cheaper and cheaper, just like water and electricity before; eventually, computing power based on tokens will also become extremely cheap.

Thus, the economics of software are being completely subverted. For example, the method of paying for software by license will probably disappear.

In 2026, software development is undergoing a genetic-level mutation.

In the past, code was a product typed out by humans line by line.

In the future, code may just be the automatic unfolding of human intent: you describe what you want, and AI can turn it into reality.

How long can a model run?

As long as you need, it can keep running.

References:

https://x.com/mntruell/status/2011562190286045552

https://x.com/leerob/status/2011565729838166269

https://cursor.com/cn/blog/scaling-agents

GPT-5.2 Runs Non-stop for 7 Days, Building a Chrome-Level Browser with 3 Million Lines of Code

Related Articles

分享網址