Edited by Yunzhao
As many of you may have sensed, since February, the buzz around "Context Engineering" and "Vibe Coding" has given way to a new term: "harness engineering."
The person who popularized this name is Ryan Lopopolo, a core engineer on OpenAI's Frontier team.
He recently published a long-form article that became a hot topic, arguably the definitive work on "harness engineering."
In this article, Ryan reveals that the newly formed OpenAI Frontier team has become OpenAI's largest Codex user. The entire team started with just three people and, after five months of extreme experimentation, ended up running a codebase of one million lines—with zero lines written by humans.
Even more fascinating for "dark factory" enthusiasts, the entire codebase had no human code review before merging.
So, how did the OpenAI Frontier team achieve zero manual coding through "harness engineering"?
Just yesterday, Ryan finally came forward with an in-depth exposé of this mysterious OpenAI project!
In yesterday's latest episode of the Latent Space podcast, Ryan stated that the core secret to implementing this "ghost library" called Symphony is: When AI fails, don't immediately think about improving the prompt; instead, ask: what capability, context, or structure is missing?
Therefore, the final result is that this "ghost library" itself contains no actual code, yet provides all the context, specifications, and workflows needed for AI agents to autonomously generate one million lines of code. This shift in mindset allowed them to increase development speed by 10x. The team no longer even reviews code; once the AI finishes writing, it can be merged directly.
Of course, the podcast also detailed the seven-layer architecture design of this ghost library. Interested readers can scroll down for more details. What's worth noting is Ryan's personal pessimism about MCP: It's already dead!
Because it forces a massive injection of tokens into the context, affects autocompaction, and agents might even forget how to use these tools.
Additionally, as a core engineer in OpenAI's newest team, Ryan offered several rigorous judgments about the future of software.
First, Ryan believes that future software must first be readable by Agents. "If software is full of implicit context, Agents cannot work effectively."
How can this readability be judged?
Ryan revealed that to pursue the ultimate inner loop speed, Ryan's team enforced a "one-minute build discipline." If the build time exceeds one minute, they intervene to refactor to ensure the Agent's feedback loop is short enough. In this mode, the code becomes highly modular, observable, and "token-efficient."
This also means that many aspects and terminologies of the original software development paradigm are changing. For example, Ryan points out that open-source software dependencies might disappear. In his view, low-to-medium complexity software of a few thousand lines can be entirely rewritten by AI and internalized by the model.
For another example, our traditional definition of software bugs is being redefined. In the Agent era, "A bug is code written by an agent that is inconsistent due to a non-functional requirement that hasn't been written down yet."
Furthermore, the traditional MVC pattern has also been redefined. The AI-native version is Model-View-Claw—Claw is the harness.
Secondly, it's worth mentioning that OpenAI's "AI-native" development process is also full of "humor."
Ryan revealed that "Humor is part of AGI." They even teach Agents to understand company culture, including generating memes and interacting on Slack. In their view, a sense of humor is an important test of intelligence, and the latest 5.4 model is already quite amazing at "meme-ing."
Another question I'm sure many of you are curious about is: What exactly does OpenAI's new Frontier team want to do? Is OpenAI going all-in on the toB (business-to-business) track?
Ryan also candidly shared the team's "grand plan." Ryan stated that Frontier's ultimate vision is: to help enterprises deploy Agents safely and at scale. The product is essentially a system for "distributing Specs," capable of integrating with enterprise IM, security tools, and workflow tools.
Ryan also revealed that the Codex app has surpassed 2 million weekly active users and is growing at 25% per week. This figure shows OpenAI's ambition to transform into a truly AI-native platform empowering global enterprises in the toB sector.
Additionally, the host observed an important signal of OpenAI's expansion: "decentralizing from San Francisco." Ryan explained that OpenAI is indeed expanding to Seattle, New York, and London. "I know people who turned down jobs at OpenAI because they didn't want to move to San Francisco," Ryan joked. "Now they have no choice."
Ryan himself was one of the first engineering hires at the Seattle office, which has a "Mad Men-style office vibe," while the new Bellevue office is "very green, gold fixtures, very Pacific Northwest."
Clearly, this newly established Bellevue, Seattle office is becoming a key pivot point for OpenAI's toB vision.
Also worth mentioning is that Ryan says he is now addicted to this kind of "harness engineering." He admits he's fed the Agent to the point of no return, even writing code with his laptop half-closed on a plane.
"It's hard to stop poking the machine because it makes me want to feed it. This continuous feedback loop, watching the AI work autonomously and constantly generate code, seems to have a special attraction."
Ryan even gave a staggering number: the team uses 1 billion tokens daily (approximately $2,000-$3,000). He says if you don't do this, it's "negligence." Being stingy with tokens is being stingy with efficiency.
Due to space constraints, I won't expand further here. Overall, it was another alpha-saturated sharing session.
Below are the key insights organized by the editor.
The Definitive Work on Harness Engineering
Host:
Alright, let's get to it. We're in the studio with Ryan Lopoplo from OpenAI. Welcome. Thank you for coming to San Francisco and for taking the time to be on our show. You wrote a viral article on harness engineering that is likely to become the defining piece for this emerging field.
Ryan Lopopolo: Thanks. It's actually quite interesting—in a way, it feels like we defined how the discussion in this field happens.
Host: Let's add some context first. This is your first time on a podcast, right? Can you talk about your background? Which team are you on? What do you do?
Ryan Lopopolo: I'm on the Frontier team at OpenAI, mainly doing frontier product exploration and new product development. Frontier is an enterprise platform, core to letting agents be deployed safely and at scale in enterprises with good governance. My team's job is to explore how to package models as products and sell them as solutions to enterprise customers.
Host: Let me add your background: Snowflake, Stripe, Citadel, right?
Ryan Lopopolo: Yeah, basically the same type of clients.
Host: But I actually didn't expect your background at first. When I looked at your Twitter, I felt the complete opposite—all "all-in on AI coding" style, like strapping a computer to a Waymo. Then I look at your resume, and you're very "serious" in traditional enterprise environments. It's an interesting combination.
Ryan Lopopolo: Haha, being an "AI maximalist" is actually quite fun. If you want to live up to that persona, OpenAI is the best place for it. And there are no rate limits internally, so I can "sprint" as you said.
Host: So you're a "special squad" within the Frontier team.
Ryan Lopopolo: Right, we were given some free rein to experiment, which is why I initially set a rather extreme constraint: write absolutely no code myself. My thinking was, if we're building agents that can be deployed to enterprises, they should be able to do everything I can do. In the past 6 to 8 months using these coding models and harnesses, I truly feel the model capability is sufficient, and the harness is mature enough to be almost isomorphic to me in ability. So I started from the "no writing code" constraint, forcing myself to complete work through agents.
Basic Architecture Approach: If the Model Can't Do It, Break It Down
Goal: Compress Build Time to 1 Minute
Host: To give some background, this is basically the core of your article: you spent 5 months developing an internal tool, wrote 0 lines of code yourselves, but the final codebase exceeded 1 million lines, and you say efficiency was 10x faster than manual work.
Ryan Lopopolo: Yeah, that was basically the thought process. We started with very early Codex CLI and Codex mini models, which were much weaker than now. But that was actually a good constraint. When you ask the model to implement a feature and it can't piece it together, the frustration is very real. So we formed a methodology: When the model can't do it, break the task down, build smaller modules, then combine them into larger goals. Honestly, the first month and a half was very inefficient, probably 10x slower than me writing code by hand. But because we paid that cost, we eventually built an "assembly line" that lets agents complete the entire development process, outperforming any single engineer.
Later we went through GPT-5.1, 5.2, 5.3, 5.4, and other model versions , and each generation behaved differently, forcing us to adapt the codebase to model changes. For example, an interesting point: in 5.2, the Codex harness didn't have a background shell, so we could rely on blocking scripts to execute long tasks. But in 5.3, with background shell, the model became less "patient" and unwilling to wait for blocking tasks. So we had to refactor the entire build system to compress build time to under 1 minute. In a multi-person, opinionated codebase, this is almost impossible. But because our sole goal was to let agents run efficiently, we switched from Makefile to Bazel, to Turbo, to NX in one week, finally settling on NX because the build speed was fast enough.
Host: You switched from Turbo to NX, that's interesting; many people do the opposite.
Ryan Lopopolo: Honestly, I don't have much experience with frontend repo architecture. You can talk to Josh about this; he's the one who set up that system. I know the NX team and am familiar with Turbo (from Jared Palmer), so the comparison is interesting. But our core goal was actually simple: make the build fast. Our app is a React + Electron monolith, requiring build time under 1 minute.
Host: I'm not that familiar with "background shell."
Ryan Lopopolo: Simply put, Codex can start tasks in the background and then continue doing other things, like running a build while reviewing code. This improves overall time efficiency.
Host: Why exactly 1 minute, not 5 minutes?
Ryan Lopopolo: We want the inner loop as fast as possible; 1 minute is just a memorable target we could achieve. If it exceeds that, we treat it as a signal: stop, break down the task, optimize the build graph, then let the agent continue working.
Host: It sounds like a "ratchet mechanism" where you enforce build time discipline, otherwise it keeps expanding.
Ryan Lopopolo: Exactly. Traditional platform teams allow build times to slowly lengthen until they become unacceptable, then spend weeks optimizing. But now tokens are cheap, and models are highly parallel, so we can continuously "prune" the system to maintain these invariants. This makes the codebase more stable and controllable and lets us rely on more determinism during development.
OpenAI Doesn't Really Rely on Humans for Code Review Anymore
Host: You mentioned in your article that humans actually became the bottleneck. At first, your team had only three people. You produced nearly a million lines of code and 1,500 PRs—what was the thinking behind this?
Code itself is somewhat "disposable," but you did a lot of review. You mentioned in the article that everything should be "prompt-ized"; anything an agent can't see is basically garbage and shouldn't exist. So overall, how did you build this system? And after humans became the bottleneck, how do humans still participate, like at the PR review level?
Ryan Lopopolo: Actually, we don't rely much on "humans doing code review" anymore. Most human review now happens after merge. Merge itself isn't really a review; it's more like a ritual for peace of mind. Fundamentally, models can be infinitely parallel; as long as I'm willing to invest GPU and tokens, it can infinitely scale capacity.
The only truly scarce resource is the team's synchronous human attention. There are only so many hours in a day; we need to eat and want to sleep, though it's hard to stop "poking the machine" because you can't help but want to keep feeding it.
So you have to step back and think with systems thinking: where are agents making mistakes? Where is my time going? How to not spend that time again? Then precipitate that experience into automation, solving part of the SDLC problem.
At first, we had to watch the code very carefully because agents didn't have good enough "building blocks" to generate modular, decomposable, reliable, observable systems, let alone a working frontend. To avoid sitting in front of a terminal all day only handling one or two issues, we invested heavily in improving model "observability" , which is the diagram you saw in the article.
Host: Let's talk about this tracing system. Did the trace come first or the application?
Ryan Lopopolo: At first there was only the application, then from vector to login, metrics, API, it took me about half an afternoon to set up. We deliberately chose high-level, high-productivity tools; the ecosystem is quite good now. For example, we heavily use a tool called MI to easily pull a whole set of Go-written VictoriaMetrics components into the local dev environment. Add some Python glue code to run these services, and it's usable. A key design is we try to "invert" the whole system: not set up the environment first then put agents in, but use the agent as the entry point—start Codex directly, then through skills and scripts, let it start the entire dev environment itself when needed, and configure environment variables so the local app points to the services it pulls up.
I think this is the fundamental difference between current reasoning models and the 4.1, 4o era models. Before, models didn't "think"; you had to lock them in a box with clear state machines. But now, we let the model and harness become the system itself, giving it enough options and context to make its own decisions.
Host: It sounds like many things are evolving toward "scaffold." But interestingly, with reasoning models, it seems like you don't need such heavy scaffolds anymore. You use structures like spec.md, very short agent.md.
Ryan Lopopolo: Right, we did define a structure, like a master directory (about 100 lines), with various "skills" underneath. The benefit is you can inject new content into the codebase at very low cost while guiding both human and agent behavior.
Host: In a sense, you "reinvented agent skills" from scratch.
Ryan Lopopolo: Indeed, because when we started, this concept didn't exist. We have a master directory file, then various small skills, like core beliefs.md, tech debt tracker, etc.
The tech debt tracker and quality score are interesting; they're essentially a minimal scaffold—a markdown table acting as a Codex hook. It checks our defined business logic, evaluates if it meets guardrails, then generates follow-up tasks for itself. Before systems like Jira, we just recorded these follow-ups in markdown, then could periodically start an agent to "pay down debt."
Here's a key insight: models "crave text." Much of what we do is essentially injecting more text into the system. For example, an online alert triggered due to missing timeout. I can directly @Codex in Slack and say "I want to add a timeout, also please update our reliability docs requiring all network requests must have timeouts." This way I fixed a bug and permanently encoded "what is good practice" into the system. This knowledge passes through context to subsequent coding agents.
You can also generate tests based on this text, or drive code review agents, narrowing the code's "acceptable space."
Leave the Model Some Judgment Space to Prevent Over-Compliance
Host: But here's a problem: you think you're making a "long-term correct" rule, but actually you might be ignoring exceptions, and later have to roll back.
Ryan Lopopolo: Indeed, models sometimes "over-comply." So in design, we leave them some judgment space. For example, the quality score tool isn't mandatory every time; the model decides when to call it.
At the prompt level, we also allow agents to "push back." When we first introduced the code review agent, the flow was: Codex generates code locally → pushes PR → triggers review agent → it writes comments → we require Codex to respond to these comments. The initial problem was the coding agent was too "obedient," getting led by the reviewer, causing the system not to converge. So later we adjusted both sides' prompts: review agent is asked to lean toward passing (only raising P2 or lower issues); P0 is the level that would destroy the whole codebase. Meanwhile, we allow the coding agent to reject or defer review comments. Reality is the same—some reviews are FYI, not requests for immediate fixes. Without this context, agents mechanically execute all instructions.
Host: Let me confirm a point: these agents can auto-merge, right? This is something many people can't accept. And your scope is almost full-stack—product code, tests, CI, release tools, internal tools, docs, review comments, even repo management scripts—all written by agents.
Ryan Lopopolo: Yes, basically everything. And they run in parallel.
Host: Is there an "emergency brake"? Like anyone on the team can one-click stop everything?
Ryan Lopopolo: Because we're doing native apps, not infrastructure, we don't do continuous deployment. Release branch cutting still requires human involvement, and must pass manually approved smoke tests before going live. In other words, there's still human gatekeeping at the release stage.
Host: So you're not building those "99.999% availability" infrastructure systems.
Ryan Lopopolo: Right, that's the case. And to emphasize: all this was done in a "brand new repo." This doesn't mean it can be directly applied to all production environments.
GPT-5.4 is the First Model to Fuse Top-Tier Coding and Reasoning
Already Letting Codex Write Its Own Blog
Host: At first, the onboarding month was basically "working backwards"; you had to constantly adapt to the system. But now you're very automated. I'm curious: what proportion of humans are still in the loop? What bottlenecks do you hope to further automate? And how do you think model capabilities will develop, further replacing humans? For example, we just got GPT-5.4, a very strong model.
Ryan Lopopolo: Yes, this is the first model to fuse top-tier coding ability and reasoning ability—both Codex-level code ability and general reasoning, plus computer use support. Now I can even let Codex write blogs directly, whereas before I had to switch between chat and coding... maybe I'll be out of a job soon (laughs).
Host: That makes me think, you could use 5.4 to make a completely AI-driven newsletter.
Ryan Lopopolo: Right, that's actually an example of "closing the loop." Like the dashboard you mentioned, we let Codex write the Grafana dashboard JSON, publish them, and respond to alerts. That is, when an alert triggers, it knows exactly which dashboard, which alert, and which log line in the codebase corresponds, because all that information is unified.
Host: So it must "own everything."
Ryan Lopopolo: Exactly, this is crucial. This means if a failure occurs that didn't trigger an alert, it can also find gaps in the monitoring system based on existing dashboards, metrics, and logs, and fix them all at once. It's like a full-stack engineer pushing features from backend all the way to frontend.
"No Obsession with Code Details Anymore"
The Key is "System Primitives"
Host: It sounds like a lot of work you did is making software fit "how models write," not "human readability." That is, shifting from human-legible to agent-legible. What does this mean for larger teams? Like inside OpenAI, or the whole software engineering industry—does everyone need to switch? After all, this is a very radical change.
Ryan Lopopolo: My mindset is: I've "detached" from specific execution. I don't have many opinions about code details; it's more like managing a 500-person team—in that case, you can't dive deep into every PR detail. So we use post-merge review as an analogy: I just sample some code, inferring team problems, bottlenecks, where support is needed, where things are smooth, then adjust my focus.
I don't have much obsession with "how code is specifically written," but I'm very concerned about "system primitives." For example, we have a command-based class to encapsulate reusable business logic, with built-in tracing, metrics, observability. These primitives are key—as long as code uses these primitives, it naturally has leverage. So the focus isn't code structure, but whether the right abstractions are used.
AI-Era MVC: Model-View-Claw
Host: This goes back to the systems thinking in your article, like how to enforce architecture, how to encode "engineering taste."
Ryan Lopopolo: Yes. And as models improve, they get better at proposing abstractions to help themselves unlock problems. This lets me stand higher and think about what's truly blocking team delivery.
Host: Your project is essentially a million-line Electron app that also manages its own services, kind of like BFF (backend for frontend).
Ryan Lopopolo: Right, we do have a backend, but deployed on cloud. Inside Electron, there are main and renderer processes, naturally forming an MVC-like structure , and we apply the same strict layering.
Here's a fun joke: traditional MVC is Model-View-Controller, but I think the AI-native version is Model-View-Claw—Claw is the harness.
Host: That's a clever way to put it.
Coding Agents Don't Just Write Code—They Do Everything
Ryan Lopopolo: I do think Codex + harness as an AI product building approach has huge exploration space. Models are improving fast in coding; each generation significantly increases task complexity. If you can "compress" product problems into code problems, then solving them with Codex harness is very natural —it handles all infrastructure, you just drive it with prompts.
And this is a very "understandable" capability extension for engineers: you just give the model those scripts you already know how to write.
Host: In other words, coding agents don't just write code; they "swallow" all knowledge work. Many think non-coding tasks need separate agents, but actually you're expanding upward from coding agents.
Ryan Lopopolo: Right, essentially you just define tasks as code problems, everything becomes a coding agent.
Revealing the "Fully Delegated" Skill Used Internally
Host: Then a practical question: are systems like tickets and PRs going to be completely restructured? Because Git itself is very unfriendly to multi-agent setups.
Ryan Lopopolo: We heavily use worktrees, but even so, merge conflicts still exist. But models are actually very good at resolving conflicts. And when I'm not staring at the terminal synchronously, those conflicts are almost "negligible" to me.
We have a skill called "dollar land" that guides Codex through the complete PR lifecycle: create PR, wait for human and agent review, wait for CI to pass, fix flaky, handle conflicts, re-merge, enter merge queue, until it hits the main branch. This is what "full delegation" means. For humans, this is a very heavy process, but agents can handle it completely; I basically just need to keep my computer on.
I used to have strong control tendencies, but now I actually think: in many things it really does better than me—given enough context.
Pay Attention to Every Agent BUG
Host: Is there anything you feel wasn't clear in the article but people are discussing?
Ryan Lopopolo: One thing I might not have explained clearly: the docs, tests, and review agents we write are essentially injecting "non-functional requirements" (like high availability, high quality, maintainability) into the model's context. We either write them as docs or indicate correct practices through lint errors. The essence of the whole system is making all the implicit knowledge in engineers' heads about "what is good" explicit, so agents can learn.
So we pay special attention to agent bugs—because every bug means there's some "not yet written specification." This is actually the driving force for system evolution.
Host: So what people misunderstood before was?
Ryan Lopopolo: Actually not a misunderstanding, but someone happened to point this out, and I realized: right, that's the core I really wanted to express.
Feeding Everything to GPT-5.4 is Also Very Valuable
Host: I see, interesting. A fascinating phenomenon is many people directly throw your article link to Pi or Codex and say "turn my repo into this." It's like achieving "complete recursion." And it actually works well?
Ryan Lopopolo: Yes, surprisingly well. I actually tried with 5.4 yesterday. I was speaking and didn't have much time, so I thought: can I quickly build a similar scaffold based on this article? (Note: template, scaffolding) I did a version, then took a small side project, completely non-production-grade stuff I just threw together with voice TTS, then asked: if I want to fully automate this into such a system, how should I modify it? This process is very valuable, because it doesn't just help you modify code; it's more like "analyzing" your system. You feed all code, context, and article to it together, and it walks you through understanding problems and improvement directions step by step.
Open Source Software Dependencies May Disappear
AI Will Internally Rewrite
Host: I want to mention another point: your board chair Bret Taylor also responded to your article. He said software dependencies might disappear, and in the future could be directly "vendored." What do you think?
Ryan Lopopolo: I basically agree. But the reality is you still pay for services like Datadog and Temporal. With current model capabilities, dependencies we can "internalize" are roughly low to medium complexity.
Host: What does "medium complexity" specifically refer to?
Ryan Lopopolo: Probably dependencies of a few thousand lines of code that we can easily rewrite in an afternoon. And a key point: you don't actually need all its features. By internalizing dependencies, you can strip away all generic logic and keep only the parts you truly need.
Host: I keep saying this is "the end of plugins."
Ryan Lopopolo: Indeed. Because open-source projects introduce lots of redundancy and complexity for generality. But if you implement it yourself, you only need the minimal set.
There's another practical benefit: when we deploy Codex security review in the repo, it can directly modify these "internalized" dependencies without going through traditional flow—file PR, wait for upstream release, pull down, handle compatibility. The friction cost of this whole flow is much lower. With tokens being cheap, code itself becomes "cheap."
Host: But the counterargument is also clear, like large-scale testing and security. Systems like Linux and MySQL rely on "crowdsourced review" to ensure quality. If you rewrite yourself, you'll likely repeat mistakes others already made.
Ryan Lopopolo: Exactly. Once you internalize dependencies, you're back to "starting from zero," needing to re-establish confidence in code quality.
All Internal AI Tools Are Written by Codex
Let Agents Complete Tasks—Humans Are the Bottleneck
Host: Back to what you said at the start: the whole system, including internal tools, was written by Codex, right? Even visualization tools?
Ryan Lopopolo: Right, I now do AI-related internal tools basically by prompting. A few days ago I showed someone, they asked how long it took, I said—I actually didn't spend time (laughs).
Here's an interesting example: after deploying the app to first internal users, we hit performance issues, so we had them export trace (a tar package) to the on-call engineer. He used Codex to make a beautiful local tool (Next.js app) to drag-and-drop this file and visualize the whole trace, very well done, took an afternoon. But later we realized this was completely unnecessary—you can just throw the tar package to Codex, let it analyze, and get results in minutes.
So, optimizing debugging processes starting from "human readability" is actually wrong. This unnecessarily involves humans when agents could directly complete it.
Host: This really requires fighting intuition—what we used to do is actually inefficient here.
Ryan Lopopolo: Yes. For example, traditionally you'd deploy Jaeger to see traces, but now you don't even need to see traces, because you won't fix code yourself.
Host: So the core is: you need a complete "self-contained system stack" and let agents fully control it.
Ryan Lopopolo: Right, this is crucial. And we'll share more about this later.
Revealing Ghost Library Build Process:
Three Codex Instances Constantly Looping
Host: We'll talk about Symphony later. You now distribute software using "spec"—some call this "ghost libraries," which is a cool term.
Ryan Lopopolo: Yes, this approach drastically reduces software distribution costs. You just define a spec, and coding agents can reconstruct the system locally based on that spec.
The process is interesting too: we extract the scaffold from the original repo, create a new repo, then let Codex generate a spec based on the original repo ; then let it start a new Codex instance to implement that spec ; then start another Codex to compare implementation with original code, continuously optimizing the spec to make them more consistent. This process loops until the spec can high-fidelity reproduce the whole system.
Humans Are Better for "Complex + Brand New" Problems; Leave the Rest to Agents
Host: And in this process, you basically didn't introduce human bias.
Ryan Lopopolo: Right. When humans write specs, they often bring their own ideas, like "I think it should be done this way," but actually agents can find better solutions themselves.
I've been thinking about a question lately: can an agent write a spec it can't implement itself? That is, can it imagine systems beyond its own capabilities?
I think this can be viewed with a two-dimensional coordinate: problems are "simple/complex" and "existing/new." For "complex + brand new" problems, humans are still needed; but other quadrants can already be solved.
This means humans can spend time on the most valuable places—those truly unknown territories, or system designs needing deep refactoring.
Host: This is actually pushing humans to higher levels of abstraction.
Ryan Lopopolo: Exactly, that's what I want to do.
The Birth of Ghost Libraries
Host: Let's officially talk about Symphony. You chose Elixir, interesting.
Ryan Lopopolo: Right, but Elixir was actually a result of the model's choice. It chose Elixir because its process model (like supervision, GenServer) is very suitable for our task orchestration approach. We're essentially starting a small "daemon" for each task and driving it to completion.
This means the model gets many capabilities "for free," like concurrency, recovery, etc. I even went to learn some Elixir and BEAM knowledge. Although most people don't need this scale of concurrency, it does provide a good mental model.
Host: How was Symphony born?
Ryan Lopopolo: In late December last year, each engineer was doing about 3.5 PRs per day. By early January, with 5.2 model online, without extra optimization, this number jumped to 5-10 PRs per person per day.
Ryan Lopopolo: I don't know if you feel the same, but this frequent context switching is very draining. By the end of the day, I'm basically exhausted. So the question comes back: where does human time go? Actually it's switching between different terminal windows (T-Mux panes) to push agents forward.
So we did another thing: figure out how to remove humans from this loop. This led to Symphony—essentially a "crazy" sprint with the goal of not needing to sit in front of the terminal all the time. We tried many approaches, like dev boxes, auto-spawning agents, etc. The ideal state is simple: I open my computer twice a day, click "yes / no," then go lie on the beach.
One change this brings: I'm no longer sensitive to "latency," nor obsessed with code itself. I almost didn't participate in the "creation process" of code, so if what's generated is garbage, I can throw it away without hesitation. In Symphony there's a "rework" state: when PR is submitted and handed to human review, this review should be very lightweight—either can merge or can't. If not, send back to rework, then the Elixir service directly deletes the whole worktree and PR, starting from scratch.
At this point the key question is: why is it garbage? What did the agent do wrong? Fix these issues first, then push the task again.
Personally Pessimistic About MCP
Host: Why aren't these capabilities in the Codex App?
Ryan Lopopolo: Our team has been "running ahead of the product," exploring AI-first as much as possible. Many things we build later go into the official product. Like Codex App, skills, automation capabilities—we've been deeply involved. But our advantage is: not bound by product rhythm, can quickly experiment, then settle into scalable solutions.
This approach is interesting but also chaotic. I often completely don't know the current real state of the codebase, because I'm not in the loop.
For example: once the team connected Playwright directly into the Electron app through MCP. I'm actually pessimistic about MCP because it forces injecting lots of tokens into context and affects autocompaction , even agents might forget how to use these tools. And actually, the Playwright calls I need might be just three types.
Later someone wrote a local daemon, wrapping Playwright, then exposing a minimal CLI. I had no idea this happened, because for me, I just ran Codex, and it got stronger.
So at the human level, we must spend lots of time on synchronous information sharing. Our daily standup takes 45 minutes, because we need to "broadcast" current system state.
"10,000 Engineer Scale" Code Architecture
Host: This is fine for single person + multi-agent, but multi-human + multi-agent combinations become very complex.
Ryan Lopopolo: Exactly, that's why we adopted "10,000 person scale" design in code architecture.
Host: What does "10,000 person scale" mean?
Ryan Lopopolo: Our repo is roughly split into 500 npm packages. For a 7-person team, this architecture is severely "over-engineered." But if you view each person as 10 to 50 agents, this capacity scale makes sense. At this point, deep splitting, module isolation, and interface boundaries become very important.
OpenAI Should Build a "Slack"
Host: You use Linear for issue management?
Ryan Lopopolo: Right, we also heavily use Slack. For some low-complexity fix tasks, we directly trigger Codex in Slack to handle, while syncing knowledge into the codebase.
Honestly, my biggest thought is: OpenAI should build a "Slack." Because if AI is to do "economically valuable work," it must naturally collaborate with humans, and this requires new collaboration tools.
The Whole Codebase Has Only 6 Skills
Host: Now Codex has gone from model → CLI → App, can parallel run multiple agents, but team collaboration is still missing. How do you think future tools will evolve? Will each team build their own, or will there be universal solutions?
Ryan Lopopolo: There's no universal answer yet. But I have a tendency: try to keep "code structure" and "process structure" consistent. Because code itself is context, is prompt. If different module structures are completely different, agents need to frequently switch context, efficiency drops.
Same logic applies to skills. Our whole codebase has only 6 skills. If some dev process isn't covered, our first reaction isn't adding new skills but integrating into existing skills. The benefit: changing agent behavior costs less than changing human behavior.
Layer 0: Skill Distillation Mechanism
Host: Do you let agents change their own behavior? Like self-optimization?
Ryan Lopopolo: Yes. We have a "skill distillation" mechanism. For example, you can let Codex review its own session log, then ask: how should I better use you? What new skills are needed? This is actually a kind of "self-reflection."
But more importantly, we aggregate all this data: everyone's sessions, PR comments, failed builds, all collected together , then run agents daily to analyze: how can we overall do better? Then feed these improvements back into the codebase.
In other words, everyone's experience automatically becomes team capability.
PR comments and build failures are actually signals—showing agents lacked context at some point. Our job is putting this missing information back into the system.
Host: I'm actually doing similar things. Every time I finish a task with AI tools, I ask: can I do better next time? This is actually a meta-programming reflection.
Ryan Lopopolo: Exactly, essentially you can view Symphony as a multi-level "reflection system." Kind of like "layer zero." So these six layers are: policy, configuration, coordination, execution, integration, observability.
We've discussed several of these, but layer zero is more like asking: is our current way of working running well? Can we improve? Like can I modify my own workflow.md? I'm not sure.
Host: Yes, of course.
Ryan Lopopolo: And this system can even create tickets itself, because we gave it full permissions. Yes, let it create a "create ticket" ticket for itself. You can even write in the ticket what follow-up work it needs. Self-modification.
So, don't lock agents in boxes. Give them full access in their domain.
Host: You just said "don't put agents in boxes"; my first thought was: actually should still put them in boxes. Just this box needs to provide everything they need.
Ryan Lopopolo: Right, context and tools.
Want Agents to Form Complete Loops—Reduce Cloud Dependencies
Host: Exactly. But we as developers are used to calling various external systems. Here you use open-source tools like Prometheus, run locally, to form complete loops, right?
Ryan Lopopolo: Right. I think you should minimize cloud dependencies as much as possible. Also seriously think about what agents can access, right? What can they see? Will this information be fed back into the loop? Most basically, like if you let it see its own call traces, it can judge where it went wrong. But the question is, do you feed these back? So at the most basic level, you need to clearly see input and output—can agents access these outputs?
It can self-improve in many ways. Essentially, these are all text, right? My job is figuring out how to flow text from one agent to another. Interestingly, when this AI wave started, Andre said: "English is the hottest new programming language." Now it's come true.
Many software was originally designed for humans, with GUIs. But now we see CLI evolution: almost all tools have CLIs, and agents use them well.
What's key next: will we have better visual capabilities? Better small sandboxes? But for now, this approach works very well. Models love using tools, love reading text, so directly giving them a CLI and letting them freely play works for almost everything.
How OpenAI Handles Non-Text Content Internally: Hybrid Approach
Ryan Lopopolo: Right. We're also converting some non-text stuff into this form to improve model performance. Like we want agents to "see" UI, but not visually perceive like humans. It won't see a red box, but see semantics like "red box button," understanding these in latent space.
So if we really want it to understand layout, sometimes rasterizing images into ASCII and feeding to it is actually easier. And actually both ways can be used simultaneously, further optimizing model's understanding of operation targets.
Coordination Layer Is Especially Hard to Get Right
Host: Want to discuss a few more layers? Any you're particularly interested in?
Ryan Lopopolo: I think the "coordination layer" is especially hard to get right.
Host: Let's talk about that.
Ryan Lopopolo: This is also where Temporal is core. The key here is, when we convert spec to Elixir implementation, the model can "take shortcuts." Because it has a set of primitives it can use in a runtime with native process supervision. I think this is an elegant way to map spec to implementation.
Like in full-stack web development, you lean toward TypeScript repos because frontend and backend can share types, reducing complexity. This is like GraphQL back in the day.
And there's no human involvement here, so whether I personally know Elixir doesn't affect whether we choose the most suitable tool, which is actually quite crazy.
Integration Layer: "MCP is Already Dead"
Host: Interesting. I'm thinking, will different languages perform differently in this paradigm? Some might be slower, some more bug-prone, like needing occasional server restarts.
Ryan Lopopolo: Possible. I think the observability layer is relatively mature. For integration layer... MCP is already "dead." But overall, this is an interesting layer system, can traverse up and down. It gives developers working in the system a common language.
Policy layer is also cool. You don't need to write much code to ensure system must wait for CI to pass; it's essentially your "institutional knowledge." You just give it GitHub CLI, add "CI must pass," that's enough.
CLI's Advantage Is "Token Efficiency"
Host: Do you think CLI maintainers need to specially optimize for agents?
Ryan Lopopolo: Actually not. Like GitHub CLI, definitely didn't expect this usage when designed, but it already works great.
CLI's advantage is "token efficient," and easy to further optimize. Like if you look at Buildkite or Jenkins logs, usually a pile of output. To help human developers, your Dev Productivity team often writes code to extract key exceptions from logs, put at page top.
CLI should be designed similarly. Like you don't need to tell agent every file is formatted; it only cares "is it formatted." This way it can decide whether to execute write operations.
Similarly, when using PNPM distributed scripts, output is huge, but most is just test logs. We finally wrote a wrapper layer to suppress irrelevant info, keeping only failures.
Host: Right, can pipe error stream separately.
Ryan Lopopolo: Right, but that's getting too engineering-detail. I maintained CLIs before, so I resonate with this.
Host: Anything else to add?
Ryan Lopopolo: This spec is long, with many strong assumptions. But the key is: you can use it, but also adapt it to your own needs.
Essentially, software is more flexible when it can adapt to deployment environments. So like Linear and GitHub written in spec aren't mandatory. You can totally swap for Jira or Bitbucket.
The key is, it defines some core things clearly, like ID format, agent loop mechanisms. This way you can quickly run a complete system, then gradually evolve.
We never intended it to be a static spec that can't be modified; it's more like a blueprint to get you running first, then slowly optimize. And many of these are actually prompts, just very long ones. Essentially agents are good at following instructions, so give them clear enough instructions, reliability improves.
Like with Symphony, we don't want humans staring at agents working. So we very strictly define success criteria, making deployment success rates higher and reducing support tickets.
Run 4 Codex in Parallel—Design System to Auto-Stay on Right Path
Host: This goes back to "disposability." Before, running a Codex task might take two hours; you'd stare at it, afraid it goes off track. But now you can directly run four in parallel.
Ryan Lopopolo: Right, what I love most about Codex App is this—direct 4x parallelism. No problem, one will be right, might even be better.
Don't overthink. My earliest example is actually deep research. When first released, I asked about an LLM question, it thought it was a legal question, spent an hour writing a completely off-track report. I thought "I need to watch this thing," but actually wrong—you shouldn't watch it. You should design the system to automatically stay on the right path, not babysit it yourself. You get a bad result from deep research, you realize you need to fine-tune the prompt, right? This is your "guardrail" feeding back to system, further aligning agent execution. Same thinking applies here.
God-Level CLI: Remove Humans from the Loop
Host: By the way, what's Symphony's customer feedback?
Ryan Lopopolo: I don't think there are real "customers" yet, since it's something we use internally. As long as you're satisfied, you're the customer.
Host: What's the external view?
Ryan Lopopolo: Everyone's very excited about this way of distributing software and ideas at low cost. For us users, productivity went up another 5x. This shows there's a "sustainable pattern": remove humans from the loop while building trust in output results.
Like the video shown here is actually what we expect coding agents to attach when creating PRs. This is part of building trust. Fundamentally, the most interesting thing about this system is it makes agents more like "teammates collaborating with you." I won't stare at every ticket you handle in a week, nor want complete screen recordings of you in Cursor or Claude Code. I just want you to prove in your own suitable way that code is reliable and can be merged, compressing the whole process into a result I as reviewer can understand.
This is actually quite natural. And you can do this, because systems like Codex are really strong.
Host: Right, like FFmpeg is a "god-level CLI tool."
Ryan Lopopolo: Right, like a Swiss Army knife. I used to say every FFmpeg flag could become a mini SaaS. Put a UI on it, make it a service, people who can't use FFmpeg will pay.
When we started these experiments, there was a strong "futuristic feel": windows constantly popping up on screen, files automatically appearing on desktop, system controlling your computer, doing actually productive things. I'm basically just keeping it from sleeping, occasionally moving the mouse.
Host: Many office workers actually do this, buy "mouse movers."
Ryan Lopopolo: Exactly.
How OpenAI Internally Uses Codex Spark Model
Host: I want to ask: since these are so strong now, can asynchronously throw a bunch of agents to run, what do you think about Spark models (Note: lightweight version of Codex model)? Like 5.3 Spark, it's better for quick small fixes, like changing one line, switching a color. I don't want to open IDE, but can let it help. But am I still the bottleneck? Why not also hand this to system to automatically handle?
Ryan Lopopolo: Spark is indeed a completely different model category. Different architecture, doesn't support complex reasoning, but extremely fast. Honestly, I haven't fully figured out how to use it. I initially used it for high-reasoning model tasks, but it consumed lots of context before actually writing code.
By the way, like 5.4's million-token context is crucial for agents. You can run longer before compressing context; more tokens, better results.
As for Spark, I think your intuition is right: it's good for quick prototypes, exploring ideas, updating docs. For us, it's very good at converting feedback into lints (like ESLint rules), because we have mature infrastructure. It does these tasks well, can quickly unblock, also good for "self-healing" code maintenance.
Current Models: Can't Yet Generate a Usable Product in One Go
Complex Task Refactoring Still Needs Human Intervention
Host: You're actually pushing models to their limits. What can't models do well now?
Ryan Lopopolo: They still can't go from "brand new product idea" to usable prototype in one go (zero-to-one). This is where I spend most time "intervening": converting an idea with no existing interface into an operable product.
Another difficulty is "complex refactoring." Although models improve with each update, the most complex refactoring tasks still need my frequent intervention. I even need to build extra tools to break down monolith systems.
But I expect these will keep improving. In just one month, we've gone from only doing low-complexity tasks to handling "large-scale + low-complexity" tasks. This is why you shouldn't underestimate models—they keep expanding into higher-complexity spaces.
So the right strategy isn't "fighting model capability" but "designing systems around it," letting it handle increasingly complex parts while you gradually retreat to higher-level problems.
Models and Existing Agent Products Solve Two Different Problems
Host: It sounds like task types are different too. Like Codex is good at understanding existing codebases, but companies like Lovable, Bolt, Replit solve the zero-to-one scaffold problem—turning ideas directly into products.
Ryan Lopopolo: Right, and these two problems are different. Models are having "step-change" improvements here. But they're still different from current software engineering agents.
I often say, models are actually "isomorphic" to me in capability; the only difference is I need to figure out how to convert what's in my head into context models can understand. On these "white space projects," I'm actually not good at them myself either.
So during agent execution, I often realize halfway what's missing, which is why I need synchronous interaction. But with better harness or scaffold, guiding me and narrowing possibility space—like strongly constrained frameworks, providing templates—these can help models get more "non-functional requirement" context, avoiding result divergence.
OpenAI's ToB Product: Frontier
Host: Let's talk about Frontier to wrap up.
Ryan Lopopolo: Sure, but I can't detail the blueprint here. Frontier is the platform we want to drive enterprise AI transformation, applicable from big to small companies. Core goal: let enterprises easily deploy agents that are "highly observable, secure, controllable, identifiable."
It needs to plug into enterprise IM systems, integrate security tools, connect workflow tools. Essentially, you're "distributing specs." We expect some harness components here; Agent SDK is core, giving startups and enterprise developers an "out-of-the-box" execution environment that fully utilizes model capabilities—like shell tools, Codex-style execution environment, file attachments, containers, etc.
Our goal is making these capabilities good enough while safely combining. For example, like GPT OSS's safeguard model, one cool thing: it has built-in ability to interface with "security specs." And security specs are often enterprise-customized.
We have responsibility helping these companies let them add controls to enterprise agents, preventing data exfiltration, according to their own concerns—like identifying internal company codenames. So the key is: providing enough "hooks" to make platform customizable while defaulting to as "out-of-the-box" as possible. This is the space we're exploring.
Frontier's Buyers and Two Types of Users
Host: Right, this is actually what companies like Snowflake, Brex, and Stripe urgently need.
I want to go back to your demo video, which actually showcases a "large-scale agent management" scenario well. Like you give users a control panel, when running multiple agents, can drill down to individual instances to see what they're doing. But who's this product's user? CEO, CTO, or CIO?
Ryan Lopopolo: My personal view is, this product's "buyers" are one type, but "users" are two types. First type is employees actually using these agents to boost productivity; they see the interfaces where agents appear, available connectors, etc.
And like this control panel, it's more for IT, GRC (Governance/Risk/Compliance), AI innovation office, security teams—people responsible for safely deploying agents into employee work environments while meeting regulatory and customer compliance commitments.
So it's more like an "iceberg structure"—employees see what's above water, below there's a whole layer.
Host: Right, each UI layer corresponds to different agent abstraction levels.
Ryan Lopopolo: Exactly. And being able to drill down to individual agent execution trajectory level is crucial—not just for security, but for those responsible for developing "skills."
We previously published an internal data agent using lots of Frontier capabilities, letting agents understand our data ontology, knowing what's actually in the data warehouse.
Host: Like a semantic layer? I've touched this area slightly; honestly, even humans struggle to unify definitions, like "how is revenue calculated." "What counts as an active user"? In a company, five data scientists might have five different definitions. There's even internal politics—marketing says "I contributed this," sales says "that's mine," adding up to over 100%.
Ryan Lopopolo: Right, exactly.
Host: And in startups, everything counts as ARR (laughs). This is indeed interesting.
Ryan Lopopolo: Right. We actually wrote blogs about this.
Feed Agents Business Data, Even Company Culture
Host: Then people can go read those. Anyway, "data as feedback layer" is crucial. You must solve this problem first to close product feedback loops.
Ryan Lopopolo: Exactly. For agents to understand business, they must know what revenue is, user segmentation, product lines. Like we mentioned in the harness codebase, there's a core_beliefs.md with who the team is, what the product is, target customers, pilot customers, next 12 months' vision—these are all important context when building software.
Host: So these need to be fed to agents too?
Ryan Lopopolo: Right.
Host: And these change dynamically, right? Not a static spec.
Ryan Lopopolo: Right, it keeps iterating.
Another potentially "mind-blowing" point: we even gave agents a "skill" to learn generating deep fried memes, and Slack reaction culture. Because through Slack + ChatGPT + Codex, I can have agents send messages for me.
Humor is actually part of AGI.
GPT-5.4 Already Knows How to Meme
Humor is Part of AGI
Host: Is it funny?
Ryan Lopopolo: Pretty good. Humor is actually a hard capability, because you need to compress lots of context into few words. This is why 5.4-type models are a big upgrade for us—in "meme-ing."
Host: Got it, conclusion: 5.4 can make us better at memes (laughs).
You can try letting Codex review your agent logs, then roast you.
Ryan Lopopolo: Haha, we can try.
Frontier Targets Scaled Agent Deployment
Host: Back to the last question: I think what you're doing is actually a model all companies should adopt, whether using your product or not. My first reaction was: every company needs this.
Ryan Lopopolo: Right.
Host: Though it sounds "boring"—security, compliance—if you really want to manage agents at scale, these are necessary. Your dashboard is like my initial understanding of Temporal: a panel showing all long-running processes in the company.
Ryan Lopopolo: Right, exactly. Though it'll be highly customized, each company focuses differently. But I think many companies will specifically do this layer in the future.
Host: I'm now completely a Frontier "believer" (laughs). Saw Frontier first, then harness and Symphony, finally realized: this is how you deliver these capabilities.
Ryan Lopopolo: Right. We assemble a series of "building blocks" into these agents, and these blocks themselves are part of the product. Like you can control agent behavior, revoke permissions when models go rogue—provided through Frontier.
Companies have many different roles; they can all see info they need on this platform, driving large-scale agent deployment.
Host: This reminds me of OpenAI's "AGI five stages"—one stage is "AI organizations." This is basically that direction.
Ryan Lopopolo: Yes. Like our team is doing one thing now: collecting Codex agent execution trajectories, then distilling into team-level knowledge base, feeding back into codebase. But this doesn't have to be tied to Codex; I hope ChatGPT can also learn our "meme culture," product logic, and ways of working. When I ask it, it has complete context.
I'm very excited Frontier can achieve this.
OpenAI Internally Debating: Train These Capabilities Directly into Models
Host: What's the model team's feedback seeing how you use these? You have lots of usage data and trajectories.
Ryan Lopopolo: There's indeed a core tension: should we continue strengthening harness (system engineering layer), or train these capabilities directly into models, making models default to doing these things.
Right. I think the "success" of our working style means models will gradually form better "taste," because we can point directions for them. And these things we built won't lower agent performance—essentially they just let agents run tests, and "running tests" is already part of writing reliable software.
If we build a whole ROS-style scaffold around Codex to forcibly limit its output, this extra harness might eventually be deprecated. But if we can build these guardrails directly on top of Codex native output—that is, code itself—then there's no friction, not affecting model continuous evolution while being good engineering practice. This is the key.
Host: I've discussed similar issues with research scientists, like on-policy vs. off-policy in reinforcement learning. You mean: should build an on-policy harness , that is, augmenting within model's own distribution, not building an off-distribution external system?
Ryan Lopopolo: Exactly.
One Version Per Month, Codex Team Insanely Shipping
Host: Interesting. Anything else we haven't discussed but you think worth adding?
Ryan Lopopolo: I'm always excited benefiting from Codex team's continuous "high-intensity iteration." Their core engineering culture is "ship relentlessly," and they really do it. From 5.3 to Spark, to 5.4, feels like accomplished in almost a month, amazing speed.
Host: Right, a month ago was 5.3, yesterday already 5.4. So next month 5.5?
Ryan Lopopolo: Haha, I can't say that, but prediction market folks might get excited.
Host: But interestingly, this also syncs with growth—they say 2 million users now. But in a way, you don't just care about Codex itself anymore, but the bigger picture: coding is just the entry point; the real goal is all "knowledge work."
Ryan Lopopolo: Exactly, that's the core direction. This is what our team is working hard to support—getting self-hosted harness running. Next is: actually "doing things."
OpenAI Expanding Beyond San Francisco to Seattle, London
Host: Anything else you want to say? You're in Seattle, right?
Ryan Lopopolo: Right, we have a new office in Bellevue, opened the day before recording. Great environment, happy to participate in building the future in Washington State.
What I want to say is, there's still lots of work serving enterprise customers at Frontier, we're hiring. If you haven't tried Codex App, suggest downloading. We just hit 2 million weekly active users, growing 25% weekly, very fast. Welcome to join us.
Host: I have an observation: OpenAI used to be very "San Francisco-centric." Many people gave up joining because they didn't want to move to SF. But now you're expanding to London, Seattle—will this change company culture?
Ryan Lopopolo: I was one of the first engineers at Seattle office, so it's very natural for me. This is the direction I've been pushing, and it's developing well. We've established stable product lines there, while also having lots of zero-to-one innovation projects.
This is actually our core way of doing "applied AI": quick sprints, constant exploration, seeing where models can truly land.
We also have an office in New York, with decent engineering team size.
Host: Got it, this basically is my "AI pilgrimage map": wherever engineers are hired, I'll go.
Ryan Lopopolo: New York office is nice too, formerly REI's office. But New York space is limited, hard to have large-scale offices like Seattle. Seattle has kind of a "Mad Men" style, very beautiful; Bellevue new office is green-themed, metal decor, very Pacific Northwest feel.
Host: Indeed, some people just like New York vibe.
Ryan Lopopolo: Right. Our workspace team does great work; lucky to work here.
Host: OK, thank you very much for sharing today, very solid content, and you're indeed "shipping insanely."
Ryan Lopopolo: Great talking, have a nice Friday.
Host: Happy Friday.
Alright, that's the end of the article. Happy weekend to all the big shots!
——Recommended Reading——