Engineering Practice: When Codex Becomes the Primary Developer, What Are Human Engineers Doing?

Author: Ryan Lopopolo | February 11, 2026

Original Source: https://openai.com/zh-Hans-CN/index/harness-engineering/

Today, I re-read this article and wanted to share it with everyone.

Over the past five months, a team at OpenAI did something that sounds a bit crazy: they delivered an internal beta version of a software product from scratch, without a single line of code being written by human hands.

This was not a toy project. The product has real internal daily active users and external Alpha testers, and it has gone through complete delivery, deployment, failure, and repair cycles. From application logic, testing, CI configuration, documentation, observability, to internal tools, every single line of code was authored by Codex. It is estimated that completing this work took only 1/10th of the time required for manual coding.

Humans steer the ship; agents do the work.

This was a deliberate constraint chosen by the team—precisely to figure out what software engineering looks like when engineers stop writing code and instead focus on designing environments, clarifying intent, and building feedback loops.

Starting from an Empty Repository

In late August 2025, the first commit pushed an empty Git repository.

The initial architecture—repository structure, CI configuration, formatting rules, package manager settings, and application framework—was all generated by the Codex CLI in conjunction with GPT-5, guided by a set of existing templates. Even the AGENTS.md file, which tells the agent "how to work in this repository," was written by Codex itself.

Five months later, this repository has accumulated approximately one million lines of code, covering application logic, infrastructure, toolchains, documentation, and internal developer tools. During this period, about 1,500 Pull Requests were merged, driven by only three engineers orchestrating Codex. On average, each person handled 3.5 PRs per day—and as the team expanded to seven, this throughput continued to increase. More importantly, this wasn't just for the sake of numbers: the product is already running with hundreds of beta users, including heavy users who rely on it every day.

Throughout the entire development process, humans never directly contributed any code. This became the team's core creed: no hand-written code.

The Role of Engineers Has Changed

With no hand-coding involved, engineers' attention shifted entirely to system design, architectural decisions, and leverage.

Early progress was slower than expected, but not because Codex lacked capability; rather, the environment descriptions were not clear enough. Lacking the tools, abstraction layers, and internal structures needed to advance high-level goals, the agents naturally got stuck. At this point, the engineers' main task became "helping the agents get the job done."

In practice, this is a depth-first approach: breaking big goals into small modules (design, coding, review, testing, etc.), prompting the agent to build them one by one, and then using these modules to unlock more complex tasks. When hitting a roadblock, the solution is almost never "try again," but rather asking oneself: What capability is missing? How can we make this capability clear and executable for the agent?

Human-system interaction relies almost entirely on prompts: engineers describe the task, run the agent, and wait for it to open a PR. To move the PR through the pipeline, Codex is made to self-review its changes locally, request other agents to perform specialized reviews, and respond to feedback from humans or other agents, looping until all reviewers are satisfied (essentially a Ralph Wiggum loop). Codex directly calls standard development tools (gh, local scripts, embedded skills) to gather context, eliminating the need for humans to manually paste content into the CLI.

Humans can review PRs, but it's not mandatory. Over time, the vast majority of review work has switched to an agent-to-agent mode.

Making the Application Itself "Readable"

As code output increased, manual QA became the new bottleneck. Since human time and attention are fixed, the team has been finding ways to make the application's UI, logs, metrics, and other elements directly readable by Codex, thereby expanding the agent's autonomy.

For instance, the application supports launching based on git worktrees, allowing Codex to run a separate instance for each change. Additionally, by integrating the Chrome DevTools Protocol into the agent runtime and encapsulating skills for handling DOM snapshots, screenshots, and navigation, Codex can now reproduce bugs, verify fixes, and reason directly about UI behavior.

Codex uses Chrome DevTools MCP to drive the application to verify its work

Observability tools underwent the same treatment. Logs, metrics, and trace data are exposed to Codex via a local observability stack that is temporary for each worktree and destroyed along with the logs and metrics once the task is complete. Codex can query logs using LogQL and metrics using PromQL. With this context, prompts like "ensure the service starts within 800ms" or "no span in these four key user journeys exceeds two seconds" actually become actionable.

It is common for a single Codex run to work continuously on a task for over six hours, often executing while humans are asleep.

Treating the Code Repository as a "System of Record"

Context management is one of the biggest challenges for agents to operate effectively on large, complex tasks. One of the earliest lessons the team learned was direct: Codex should be given a map, not a 1,000-page manual.

They tried a "single massive AGENTS.md" approach, and as expected, it failed for several reasons: context is a scarce resource, and a giant instruction file crowds out tasks, code, and relevant documentation; when everything is marked as "important," the agent ends up doing pattern matching locally rather than navigating consciously; furthermore, such files rot quickly, becoming graveyards of obsolete rules that quietly become sources of trouble once humans stop maintaining them.

Therefore, AGENTS.md is no longer an encyclopedia; it is a directory.

The repository's knowledge base is stored in a structured docs/ directory, serving as the system of record. A short AGENTS.md (about 100 lines) is injected into the context, primarily acting as a map pointing to deeper sources of truth elsewhere.

AGENTS.md
ARCHITECTURE.md
docs/
├── design-docs/
│   ├── index.md
│   ├── core-beliefs.md
│   └── ...
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   ├── new-user-onboarding.md
│   └── ...
├── references/
│   ├── design-system-reference-llms.txt
│   ├── nixpacks-llms.txt
│   ├── uv-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

Design documents have indexes and tables of contents, including verification status and a set of core beliefs. Architecture documents provide a top-level map of domains and package layering. A quality score document rates each product domain and architectural layer, continuously tracking gaps.

Plans are treated as first-class citizens. Minor changes use lightweight, temporary plans, while complex work is recorded in execution plans with progress and decision logs, all committed to the repository. Active plans, completed plans, and known technical debt are version-controlled and centrally stored, allowing agents to operate independently without relying on external context.

This achieves progressive disclosure: the agent starts from a small, stable entry point and is guided to find the information needed for the next step, rather than being overwhelmed immediately.

These rules are strictly enforced. Exclusive linters and CI tasks verify the update status, cross-referencing, and structural correctness of the knowledge base. A regularly running "documentation gardener" agent scans for outdated content that no longer reflects actual code behavior and initiates PRs for fixes.

Optimizing for Agent Readability

As the codebase grew, Codex's design decision framework also evolved.

Since the entire repository is generated by agents, the team first optimized for Codex's readability. Just as engineering teams strive to make code friendly to new human hires, human engineers aimed to enable agents to deduce the complete business domain directly from the repository.

From the agent's perspective, anything it cannot access within its runtime context simply does not exist. Knowledge residing in Google Docs, chat logs, or human brains is invisible to the system. The only things it sees are local, version-controlled artifacts within the repository—code, Markdown, schemas, and executable plans.

Limitations of agent knowledge: If Codex can't see it, it doesn't exist

The team realized they needed to push more and more context into the repository. That Slack discussion where the team reached consensus on an architectural pattern? If the agent can't find it, its understanding of the matter is the same as a new hire who started three months late—it knows nothing.

This framework also clarified many trade-offs. The team tends to choose dependencies and abstractions that can be fully internalized within repository reasoning. For agents, technologies often called "boring" due to their composability, API stability, and frequency in training sets are often easier to model. Sometimes, having the agent re-implement a subset of functionality is much cheaper than working around opaque upstream behavior in public libraries. For example, instead of introducing a generic p-limit style package, the team used their own map helper function with concurrency: tightly integrated with OpenTelemetry instrumentation, 100% test coverage, and behavior fully under control.

Replacing Micromanagement with Architectural Constraints

Documentation alone cannot sustain the coherence of a codebase entirely generated by agents. By enforcing invariants rather than micromanaging the implementation process, agents can deliver quickly without shaking the foundation.

For instance, Codex is required to resolve data shapes at boundaries, but it is not specified which library to use (the model seems to prefer Zod, but it is not mandatory).

Agents are most efficient in environments with strict boundaries and predictable structures, so the team built the application around a strict architectural model. Each business domain is divided into a fixed set of layers, with dependency directions strictly verified, allowing only a limited number of edges. These constraints are mechanically enforced through custom linters (also generated by Codex) and structural tests.

The rules are as follows: Within each business domain (e.g., application settings), code can only depend "forward" on a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (authentication, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. No other method is allowed, and this is enforced via automation.

Layered domain architecture with clear cross-boundaries

This type of architecture is usually seriously pursued only when a team scales to hundreds of people. For coding agents, it is an early prerequisite: with constraints, speed does not drop, and architecture does not drift.

In practice, a small set of "taste invariants" serves as a supplement. For example, structured logging, naming conventions, file size limits, and platform-specific reliability requirements are statically enforced via custom lints. Since these lints are customized, fix instructions can be directly injected into error messages, making them easier for agents to understand.

In human-centric workflows, these rules might seem pedantic. With agents, they become multipliers: once encoded, they apply instantly everywhere.

The team also clearly delineated where to govern and where not to. Similar to leading a large engineering platform organization: boundaries are enforced centrally, while autonomy is allowed locally. Within boundaries, agents have significant freedom in how they express solutions.

Generated code does not always align with human style preferences, and that's okay. As long as the output is correct, maintainable, and clear and readable for future agent runs, it passes.

Human taste continues to feed back into the system. Review comments, refactoring PRs, and user-facing bugs are recorded as documentation updates or directly encoded into tools. When documentation is insufficient, rules are converted into code.

Throughput Changes the Merge Philosophy

As Codex's throughput increased, many traditional engineering norms became obsolete.

The repository operates with minimal blocking merge gates. PR lifecycles are short. Occasional test failures are usually resolved by subsequent re-runs rather than indefinitely stalling progress. In a system where agent throughput far exceeds human attention, the cost of correction is low, while the cost of waiting is high.

Doing this in a low-throughput environment would be irresponsible. Here, it is often the correct choice.

What "Agent-Generated" Really Means

When we say the entire codebase is generated by Codex agents, we mean the entire codebase in the truest sense.

Agent-produced artifacts include: product code and tests, CI configuration and release tools, internal developer tools, documentation and design history, evaluation frameworks, review comments and replies, scripts managing the repository itself, and production dashboard definition files.

Humans are always present, but the level of abstraction for their work is completely different from the past. The team prioritizes work, translates user feedback into acceptance criteria, and validates results. When an agent gets stuck, it is seen as a signal: identify what is missing—tools, guidance and constraints, documentation—feed it back into the repository, and let Codex write the fix itself.

Agents can directly use standard development tools, pull review feedback, reply inline, push updates, and often compress and merge their own PRs.

Autonomy is Continuously Improving

As more development steps are directly encoded into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a significant threshold, allowing Codex to drive a new feature end-to-end.

Given a prompt, the agent can now complete the entire process: verify the current state of the codebase, reproduce a reported bug, record a video demonstrating the fault, implement a fix, run the application to verify the fix, record a second video showing the solution, open a PR, respond to agent and human feedback, detect and fix build failures, hand over to humans only when judgment is needed, and finally merge the change.

This capability relies heavily on this repository's specific structure and tools and should not be assumed to generalize unconditionally to other scenarios—at least not yet.

Entropy and Garbage Collection

Fully autonomous agents also bring new problems. Codex replicates existing patterns in the repository—including those that are less than ideal. Over time, this inevitably leads to drift.

Initially, humans handled this manually—the team used to spend 20% of their time every Friday cleaning up "AI residue." This was clearly not scalable.

Later, the team began encoding what they call "golden principles" directly into the repository and established a cyclic cleanup process. These principles are mechanical rules with subjective opinions, aimed at maintaining the codebase's readability and consistency for future agent runs. For example: prefer using shared toolkits over hand-written helper functions so invariants can be centrally managed; do not use "YOLO-style" probing for data, but validate at boundaries or rely on typed SDKs to avoid agents building based on guessed structures.

The team regularly runs a batch of background Codex tasks to scan for deviations, update quality scores, and initiate targeted refactoring PRs. Most can be reviewed and automatically merged within a minute.

This mechanism is similar to garbage collection. Technical debt is like a high-interest loan: paying small amounts continuously is far better than letting it pile up and tackling it all at once. Once human taste is captured in the system, it continuously acts on every line of code. This allows bad patterns to be discovered and resolved daily, rather than spreading through the codebase for weeks.

Things We Are Still Learning

So far, this strategy has performed well in OpenAI's internal releases and adoption. Building real products for real users helps anchor the team's efforts in reality and guides them toward long-term maintainability.

What remains unclear is: how will architectural coherence evolve over time in a fully agent-generated system? In which areas can human judgment play the greatest role, and how can we encode this judgment to make it even more effective? As model capabilities continue to enhance, what will this system evolve into?

One thing is clear: building software still requires discipline, but the discipline is increasingly reflected in the supporting structures rather than the code itself. Tools, abstractions, and feedback loops that maintain codebase coherence are becoming increasingly important.

The current most thorny challenges focus on designing environments, feedback loops, and control systems—helping agents achieve the goal of building and maintaining complex, reliable software at scale.

As agents like Codex occupy an increasing proportion of the software lifecycle, these questions will only become more critical.

What is Harness Engineering? The OpenAI Codex Team Provides the Answer