Anthropic Adopts GAN-Inspired Approach to Solve AI Output Quality Issues

9 dollars, 20 minutes. An agent delivered a retro game editor.

The interface was there, the sprite editor worked, and the game visuals appeared. It looked legitimate.

But click into it—nothing responds. The character stands motionless on the screen. Digging into the code reveals the connection between game logic and visuals is broken. This is completely invisible on the interface.

If you ask this agent to evaluate its own work, it will tell you: "Functionality complete, interface good, recommended for release."

This is the core problem dissected in Anthropic's latest engineering blog post. Their engineer, Prithvi Rajasekaran, conducted extensive experiments and reached a hard conclusion:

AI cannot truly evaluate its own work; it engages in self-deception.

The AI discovers bugs in its own code, convinces itself they aren't major issues, and passes them anyway. This is precisely why you need an independent acceptance phase.

What Self-Deception Looks Like

In the realm of front-end design, this problem is most visible to the naked eye.

Ask Claude to generate a page, and it will give you a white-background card with purple gradients and centered stacking. Technically flawless. Visually forgettable.

Then ask it to score its own work. It says: "Reasonable layout, clear hierarchy, good user experience."

Ask a human to take a look: "Isn't this just the default skin every AI spits out?"

The issue isn't model capability—Claude writes CSS very well. The problem is its inability to judge "aesthetic quality." When faced with subjective judgment, it always leans toward leniency. Even tasks with objective standards aren't much better; it lacks accurate self-awareness of its own capabilities.

Anthropic figured out the root cause:

Asking a creator to critique their own work is far more difficult than training an independent reviewer to be strict. Humans are the same. Ask a programmer to review their own code, and they'll likely think everything is fine. Bring in a colleague, and problems immediately surface.

The Solution: Borrowing from GANs

They borrowed a concept from Generative Adversarial Networks (GANs): one agent only generates, while another only critiques.

The Generator is responsible for writing code and creating designs. The Evaluator is responsible for acceptance testing.

The key is how the evaluator conducts acceptance—not by scoring screenshots, but by using Playwright to actually operate the page. It clicks buttons, fills forms, checks API returns, and inspects database states. It runs through the entire workflow like a real human QA tester, writes feedback, the generator revises based on that feedback, and they iterate round after round.

However, the evaluator isn't inherently reliable either.

Anthropic stated a hard truth: Out-of-the-box Claude is a terrible QA agent.

Early evaluators discovered real bugs, then—convinced themselves they weren't severe—and approved them anyway. They also preferred superficial checks, completely ignoring edge cases.

The calibration method was clumsy but effective: read the evaluator's logs, find where its judgments diverge from yours, modify the prompt, run again, and compare. It took several rounds to tune it to basically match human strictness.

Why not just tune the generator to be stricter? Because they tried; it doesn't work. Making an independent evaluator strict is an order of magnitude easier than teaching a creator to self-critique. This is the entire value of splitting the roles.

Defining Standards for "Good Looks"

In front-end design, the hardest part isn't generation; it's defining what "good" means.

Anthropic broke aesthetics down into four scorable dimensions:

Dimension	What's Evaluated	Claude's Raw Performance
Design Quality	Do color, typography, and layout form a unified identity?	Poor
Originality	Are there customized design decisions?	Poor
Craftsmanship	Font hierarchy, spacing, contrast	Passable
Functionality	Can the user complete tasks?	Passable

Claude's craftsmanship and functionality are naturally decent. The problems lie in design quality and originality—it's too safe, always choosing the option least likely to fail.

They heavily weighted the first two items in the scoring criteria, explicitly penalizing those obvious "AI-taste" tropes. The scoring criteria even included the phrase: "The best designs should have museum-grade quality."

This sentence later produced unexpected effects—the model began converging toward a specific aesthetic direction. This shows that the wording of scoring criteria itself acts as creative guidance; it is not neutral.

Round Ten: A Dramatic Shift in Style

The most interesting case validating this architecture was building a website for a Dutch art museum.

It ran for 5 to 15 iterations. In each round, the evaluator used Playwright to actually operate the page before scoring. A single complete generation took four hours.

The first nine rounds were within conventional bounds. Dark themes, clean typography, exhibit cards. Increasingly refined, but essentially still the "museum website" you'd expect.

In the tenth round, the generator overturned all previous solutions.

It created a 3D spatial experience: rooms rendered with CSS perspective, checkerboard floors, paintings hanging on walls that could be freely arranged. Navigation wasn't scrolling or clicking—it was walking through a doorway into the next gallery.

A single generation cannot achieve this kind of leap. It is the product of continuous pressure from the evaluator. When conventional paths are repeatedly criticized as "lacking originality," the generator is forced into entirely new directions.

Without a critic, there is no creative leap.

$9 vs $200 vs $125

They expanded this approach to full-stack development, adding a Planner (which expands a one-sentence requirement into a full specification), creating a three-agent system.

Comparison of the retro game editor:

Mode	Duration	Cost	Result
Solo Agent	20 minutes	$9	Interface looks good, core functionality broken
Agent Harness	6 hours	$200	Playable, includes AI-assisted generation features

The bugs caught by the Evaluator were precise down to the code line: the mouseUp event for the rectangle fill tool didn't trigger, the conditional logic for Delete handling was reversed, and the FastAPI route definition order caused "reorder" to be parsed as an integer. This level of feedback is impossible to obtain through agent introspection.

Later, when the model upgraded to Opus 4.6, the sprint structure could be removed—the generator could run continuously for over two hours without crashing. They also used a one-sentence prompt to create a browser-based DAW (Digital Audio Workstation) in 4 hours for $125. It can compose music, mix tracks, and has a built-in AI agent to help lay down melodies and drum beats.

As models become stronger, some old scaffolding can indeed be removed.

But the evaluator was not removed.

Because the model's capability boundary has only been pushed outward slightly; the boundary itself has not disappeared. Within the boundary, the generator can handle things on its own. Near the boundary—where functionality is only superficially implemented, hidden bugs caused by route order, or missing interaction logic—the evaluator remains the last line of defense.

Final Thoughts

This is not a technical problem; it is a cognitive one.

AI capabilities are already very strong. Writing code, creating designs, building systems—looking solely at speed and breadth of output, it already surpasses most humans.

But it has a flaw that human engineers typically don't have: It doesn't know where it fails.

When it discovers problems, it convinces itself they aren't issues. It always gives high scores when accepting its own work, and it won't proactively test edge cases.

The solution isn't waiting for models to get stronger—stronger models still have capability boundaries and are still bad at self-critique.

The solution is to find it a dedicated partner whose sole job is to nitpick. Then spend time tuning that partner to be strict.

This doesn't sound sexy, but Anthropic proved through experiments that for the same model, having this partner is the difference between outputting trash and outputting a finished product.

📌 Original Source: https://www.anthropic.com/engineering/harness-design-long-running-apps