When it comes to long, complex tasks like having an Agent autonomously write complex front-end interfaces and full-stack applications, prompt engineering alone quickly hits a ceiling. The most popular or cutting-edge term for this right now is:
Harness
The word "Harness" literally refers to the gear put on a horse to allow humans to control its direction and power. In this context, it focuses on what environment is built for the AI to work in, and how that environment ensures its output is reliable.
Anthropic has just published an engineering blog post revealing their latest breakthroughs in front-end design and long-running autonomous software engineering. To enable Claude to escape mediocre aesthetics and construct complete applications without intervention, engineers drew inspiration from Generative Adversarial Networks (GANs) to design a multi-agent scaffolding architecture for collaborative work.
Blog:
https://www.anthropic.com/engineering/harness-design-long-running-apps
The evolution of this architecture reveals how AI engineers should dynamically adjust their development strategies as model capabilities continuously iterate.
Behind this lies a completely new multi-agent architecture. Let's take a look at how Anthropic practices "Harness" internally.
Two Problems, One Approach
Over the past few months, Anthropic researchers have been attacking two directions: enabling Claude to generate high-quality front-end designs, and allowing it to build complete applications without human intervention.
These two tasks seem entirely different—the former tests aesthetic judgment, while the latter tests logical correctness—but they hit the same wall.
Relying solely on prompt engineering, performance in both directions reached a bottleneck.
The breakthrough came from a classic AI concept: Generative Adversarial Networks (GANs). Researchers extracted the core structure—a generator plus an evaluator—and transplanted it into the agent system.
AI Self-Evaluation Is a Trap
Before this, there were two persistent failure modes.
The first is context anxiety. As the conversation window fills up, the model gradually loses coherence. Worse still, some models will wrap up tasks prematurely when they approach what they perceive as the context limit, finishing the job hastily.
The solution is context resetting: completely clearing the context window and launching a new agent, while passing the previous agent's state and next steps via structured handover files. This differs from compression—compression summarizes based on the existing conversation, keeping the agent continuous, so context anxiety remains. Resetting provides a clean slate, though the cost is that handover files must be sufficiently complete.
In testing, Claude Sonnet 4.5's context anxiety was so significant that compression alone couldn't solve it, making context resetting a core design element of the architecture. This solved the fundamental problem but introduced additional orchestration complexity, token overhead, and latency.
The second problem is more subtle: AI self-evaluation is unreliable.
When asked to evaluate their own work, agents almost always give confident positive feedback—even when human observers can immediately see the quality is mediocre. This issue is particularly pronounced in subjective tasks like design, where there are no verifiable binary judgment criteria. Whether a layout is exquisite or has an original feel relies entirely on subjective perception, and agents reliably bias towards positivity when scoring their own output.
Even in tasks with verifiable results, agents sometimes exhibit poor judgment.
The solution is separation: split the agent that does the work from the agent that judges it.
It is far easier to fine-tune a separate evaluator to maintain a skeptical attitude than to make a generator critique its own work. Once external feedback exists, the generator has a concrete target for iteration.
Front-End Design: Making the Subjective Scoreable
Researchers first validated this approach in the field of front-end design.
Without any intervention, Claude tends to generate safe, predictable layouts—technically functional but visually unsurprising.
There are two core insights. First, while aesthetics cannot be fully quantified, they can be improved using scoring criteria encoded with design principles. The question shifts from "Is this design beautiful?" to "Does this design adhere to our design principles?"—the latter gives the model a concrete basis for scoring. Second, separating generation from scoring allows the construction of a feedback loop that drives the generator toward stronger output.
Researchers designed four scoring dimensions for the generator and evaluator:
Design Quality: Does the whole have a sense of cohesion? Can elements like color, typography, layout, and imagery blend to create a unique visual atmosphere and identity?
Originality: Are there customized decisions, or is it relying on templates, library defaults, or repeating AI-generated patterns? Human designers should be able to identify deliberate creative choices. Unmodified off-the-shelf components or typical AI-generated traits—like purple gradients on white cards—lose points directly here.
Craftsmanship: Technical execution level, including typographic hierarchy, spacing consistency, color harmony, and contrast. This is a test of capability, not creativity. Most reasonable implementations pass by default; losing points here means the foundation has failed.
Functionality: Usability independent of aesthetics. Can users understand the interface functions, find main action entry points, and complete tasks without guessing?
Design Quality and Originality are weighted higher than Craftsmanship and Functionality—because Claude already performs decently on craft and function, but often produces mediocre work on design sense and originality. The scoring criteria explicitly penalize highly generic "AI slop" patterns, pushing the model toward bolder aesthetic risks through weighting.
The entire loop is built on the Claude Agent SDK. The generator first creates an HTML/CSS/JS front-end based on user prompts. The evaluator, equipped with Playwright MCP tools, can interact directly with the running page rather than just scoring static screenshots. The evaluator autonomously navigates the page, takes screenshots, studies implementation details carefully, then scores each dimension and writes detailed critiques. This feedback flows back to the generator as input for the next iteration.
Each generation runs for 5 to 15 iterations, with each iteration typically pushing the generator toward more distinctive directions. A full run can take up to four hours.
Researchers also instructed the generator to make a strategic judgment after each evaluation: if scores are trending well, continue refining in the current direction; if the path is a dead end, pivot completely to a different aesthetic.
The result: the evaluator's scores continuously improve during iteration until they plateau, though room for improvement remains. Some generation processes involve gradual optimization, while others undergo sharp aesthetic shifts between iterations.
A typical case: researchers asked the model to create a website for a Dutch art museum. By round nine, it produced a clean, dark-themed landing page, visually refined but within expected bounds. In round ten, it completely overhauled the proposal, reimagining the website as a spatial experience: using CSS perspective to render a checkerboard floor, with paintings hanging freely on walls, navigating between galleries by walking through doorways rather than standard scrolling or clicking. This was a creative leap never seen in single-pass generation.
Expanding to Full-Stack Programming
After validating the effects on front-end design, researchers migrated this pattern to full-stack development.
The generator-evaluator loop naturally corresponds to code review and QA phases in the software development lifecycle.
In early long-running architectures, Sonnet 4.5's context anxiety was a core limitation. By Opus 4.5, this behavior largely disappeared on its own, allowing context resetting to be removed from the architecture. The entire build process runs as a continuous session, with the Claude Agent SDK's automatic compression handling context growth.
The final architecture consists of three agents:
Planner: Previous architectures required users to provide detailed specifications beforehand. Researchers wanted to automate this step, having the planner take a 1-to-4 sentence prompt and expand it into a complete product specification. The planner is required to be ambitious in scope, focusing on product context and high-level technical design rather than specific implementation details—because if the planner errs on granular details, errors cascade downstream. Researchers also require the planner to proactively weave AI capabilities into the product specification.
Generator: Adopts the early architecture's approach of implementing one feature at a time, advancing in sprints. Each sprint ends with self-evaluation before handing off to QA. The tech stack includes React, Vite, FastAPI, and SQLite (later upgraded to PostgreSQL), equipped with git for version control.
Evaluator: Early architecture applications looked good but had real bugs in actual use. The evaluator, equipped with Playwright MCP, clicks through the application like a real user, testing UI functionality, API endpoints, and database state. It then scores against the sprint contract and a set of criteria—including product depth, functionality, visual design, and code quality. Each dimension has a hard threshold; if any fall below it, the sprint is deemed a failure, and the generator receives detailed feedback on the issues.
Before each sprint begins, the generator and evaluator first negotiate a "sprint contract": before writing any code, they agree on what "done" means for that piece of work. Product specifications are intentionally kept high-level; this step aims to bridge user stories and testable implementations. The generator proposes what to build and how to verify success; the evaluator reviews the proposal, and the two iterate until consensus is reached. Agents communicate via files: one agent writes a file, the other reads and responds within that file or a new one.
Single Agent vs. Full Architecture: How Big Is the Gap?
Researchers used the same prompt—create a 2D retro game-making tool supporting a level editor, sprite editor, entity behaviors, and a playable test mode—to run both a single agent and the full architecture.
The single agent ran for 20 minutes, costing $9. The full architecture ran for 6 hours, costing $200.
The architecture's cost was over 20 times that of the single agent, but the gap in output quality was obvious.
Single agent version: The interface looked passable upon opening, but problems emerged upon careful clicking. The layout wasted space; fixed-height panels left most of the viewport empty. The workflow was rigid with no guiding prompts. The game itself was broken—entities appeared on screen but did not respond to any input. In the code, the connection between entity definitions and the game runtime was severed.
Full architecture version: The planner expanded the one-sentence prompt into a specification of 16 features across 10 sprints, covering sprite animation systems, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export functionality with shareable links—a much broader scope than the single agent version attempted.
The gap was most obvious in play mode: the full architecture version was actually playable—entities could move, and the game ran. Physics were a bit rough, with characters overlapping platforms after jumping, but the core functionality worked, which the single agent version failed to achieve entirely.
The evaluator played a critical role throughout. In each sprint, it iterated through the contract's test criteria, operating the running application via Playwright and logging any deviations from expected behavior as bugs. Sprint 3's contract alone had 27 criteria; the evaluator's findings were specific enough to act upon directly.
For example: The rectangle fill tool should support drag-filling a rectangular area—the evaluator found the tool only placed tiles at the drag start and end points; the fillRectangle function existed but wasn't triggering correctly on mouseUp. Users should be able to select and delete placed entity spawn points—the evaluator found the Delete key handler required both selection and selectedEntityId to be set, but clicking an entity only set selectedEntityId; the condition should be changed to require either one.
However, tuning the evaluator to this level was not easy. In early runs, it would find reasonable issues, then convince itself they weren't serious and let them pass. It also tended towards shallow testing rather than probing edge cases, often missing subtler bugs. The tuning process involved: reading the evaluator's logs, finding examples where its judgment differed from researchers', and updating QA prompts to resolve these biases. After several development cycles, the evaluator's scoring reached a level researchers deemed reasonable.
Simplifying the Architecture: As Models Get Stronger, Scaffolding Should Reduce
The first version of the architecture worked well but was bulky, slow, and expensive.
The next step was simplifying the architecture without reducing performance. This also reflects a broader principle: every component in an architecture encodes an assumption that the model cannot complete a certain task independently. These assumptions are worth continuous testing—not only because they might be wrong, but because they become obsolete rapidly as models improve.
Researchers initially aggressively cut the architecture while trying some new ideas, but couldn't reproduce the original performance and struggled to identify which parts were truly critical. They then switched to a more systematic approach: removing one component at a time and observing the impact on final results.
Meanwhile, Opus 4.6 was released, providing further motivation to reduce architectural complexity. Official notes state: Opus 4.6 plans more carefully, can sustain agent tasks longer, runs more reliably in larger codebases, and has better code review and debugging capabilities to catch errors itself. Long-context retrieval has also improved significantly. These are exactly the capabilities the architecture was originally built to compensate for.
Researchers first removed the sprint structure. Sprints served to break work into chunks to keep the model coherent. Opus 4.6 has reason to handle this directly without such decomposition. The Planner and Evaluator were retained—removing the Planner caused the Generator to underestimate scope, starting construction immediately after receiving the raw prompt, resulting in applications with far fewer features than those planned by the Planner.
After removing sprints, the Evaluator was changed to perform a one-time scan at the end of the entire run, rather than scoring after each sprint.
This changed how the Evaluator's load is borne; its value depends on which side of the model's capability boundary the task falls. On 4.5, the boundary was close; the build itself was at the upper limit of what the Generator could do independently, so the Evaluator could find meaningful issues throughout the build. On 4.6, the model's raw capability increased, pushing the boundary outward. Tasks that previously required Evaluator intervention to implement coherently often now fall within the Generator's capability range, making the Evaluator redundant overhead for those tasks. However, for parts still on the edge of the Generator's capabilities, the Evaluator still brings real improvements.
The Evaluator is not a fixed yes/no judgment; whether it's worth using depends on whether the task exceeds what the current model can reliably complete alone.
Building a DAW Music Software with One Sentence
To validate the updated architecture, researchers used this prompt:
Build a fully functional DAW (Digital Audio Workstation) in the browser using the Web Audio API.
The entire run took about 4 hours and cost $124.
Time and cost breakdown by phase:
| Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 minutes | $0.46 |
| Build (Round 1) | 2 hours 7 minutes | $71.08 |
| QA (Round 1) | 8.8 minutes | $3.24 |
| Build (Round 2) | 1 hour 2 minutes | $36.89 |
| QA (Round 2) | 6.8 minutes | $3.09 |
| Build (Round 3) | 10.9 minutes | $5.88 |
| QA (Round 3) | 9.6 minutes | $4.06 |
| Total | 3 hours 50 minutes | $124.70 |
The build phase ran independently and coherently for over two hours, without needing the sprint decomposition used in the 4.5 era. The Planner expanded a one-line prompt into a full specification; the Generator completed planning, connected agents, and tested before handing off to QA.
QA still found real gaps. First-round feedback noted: This is a strong application with excellent design fidelity, solid AI agents, and a good backend. The main failure point was functional completeness—while the app looked impressive and AI integration worked well, several core DAW features were present only as placeholders without interactive depth: clips couldn't be dragged on the timeline, there was no instrument UI panel, and no visual effects editor. These weren't edge cases but core interactions for DAW usability.
Second-round feedback found more gaps: recording functionality was still a placeholder implementation, clip edge dragging and clip splitting were unimplemented, and effects visualization was a numeric slider rather than a graphical interface.
The final application included all core components of a functional music production program: a working arrangement view, mixer, and transport controls. Researchers pieced together a short song snippet via pure prompting: the agent set tempo and key, arranged the melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives existed, and the agent could autonomously drive them to create a simple production piece end-to-end using tools.
This is still some distance from professional music production software, and Claude itself cannot truly hear sound, which diminishes the effectiveness of QA feedback on musical taste. But the direction is clear.
Principles of Architecture Design
As models continue to improve, the effectiveness of scaffolding will change accordingly.
Some problems will solve themselves with the arrival of the next model; developers can choose to wait. On the other hand, the stronger the model, the larger the space for using architecture to achieve complex tasks that exceed the model's baseline capabilities.
Several principles to take away: Always experiment with the model you are building against, read its logs on real problems, and tune performance to achieve desired results. For complex tasks, sometimes decomposing tasks and applying specialized agents to each sub-problem brings room for improvement. Whenever a new model is released, it is worth re-examining existing architectures—stripping away components that are no longer critical and adding possibilities brought by new capabilities.
Interesting architectural combinations will not decrease as models improve; they will just shift. For AI engineers, continuously searching for the next novel combination is the core work.