OpenAI Officially Reveals: How Codex Completes Tasks Step by Step

Just now, the OpenAI developer team published a blog post that directly dissects the "kernel" of Codex CLI.

The article comprehensively explains the entire process from user input to final response, including core details such as prompt construction, model inference, tool calls, and context management.

For developers who want to understand the underlying mechanisms of AI programming assistants, engineers building their own Agent systems, and users interested in Codex, this article is worth reading.

Full Text Below

Have you ever wondered what happens between sending a command to Codex and receiving the result?

Each round of dialogue goes through: assembling input, running inference, executing tools, and feeding the results back into the context.

This loop continues until the task is complete.

This is the first part of OpenAI's Codex technical disclosure series, and OpenAI has promised to release more content in the future.

What is the Agent Loop?

The Agent Loop is the core logic of Codex CLI, responsible for coordinating interactions between the user, the model, and the tools.

Simply put, the process is as follows:

The Agent receives the user's input and assembles it into instructions for the model (i.e., the prompt).

The next step is inference: sending the prompt to the model to generate a response. During inference, the text prompt is first converted into a sequence of tokens (integer indices), and then the model samples based on these tokens to produce a new token sequence.

The output tokens are then converted back to text, which is the model's response. Since tokens are generated one by one, many LLM applications support streaming output. You can see the answer appearing word by word.

After inference, the model either (1) directly gives the final answer or (2) requests the execution of a tool call (e.g., "run the ls command and tell me the result"). If it's the latter, the Agent executes the tool, appends the output to the original prompt, and then queries the model again.

This process will loop until the model no longer requests tool calls but instead produces a message for the user (referred to as an assistant message in OpenAI's terminology).

This message may directly answer the user's question or may ask the user a follow-up question.

From user input to Agent response, this process is called one "turn" of dialogue. In Codex, it's called a "thread".

Although it's one turn of dialogue, it may contain multiple iterations of "model inference → tool call".

Whenever a user sends a new message to an existing conversation, the previous conversation history (including previous messages and tool calls) will be part of the new prompt.

This means the longer the conversation, the longer the prompt becomes.

This is crucial because every model has a context window, which is the maximum number of tokens that can be processed in a single inference. Note that this window includes both input and output tokens.

An Agent may call hundreds of tools in a single conversation, easily filling up the context.

Therefore, context window management is one of the important responsibilities of an Agent.

Model Inference

Codex CLI runs model inference by sending HTTP requests to the Responses API.

The Responses API endpoint used by Codex CLI is configurable, so it can be compatible with any endpoint that implements the Responses API:

When logging in with ChatGPT, the endpoint is https://chatgpt.com/backend-api/codex/responses
When authenticating with an API key, the endpoint is https://api.openai.com/v1/responses
When running gpt-oss with the --oss parameter (with ollama 0.13.4+ or LM Studio 0.3.39+), it defaults to the local http://localhost:11434/v1/responses
It can also use Responses API hosted by cloud service providers like Azure

Building the Initial Prompt

As a user, you don't need to write the complete prompt yourself.

You only need to specify various inputs in the request, and the Responses API server will decide how to assemble this information into a prompt that the model can understand.

You can think of the prompt as a "list".

Each entry in the initial prompt has a role, indicating the weight of that content, from highest to lowest: system, developer, user, assistant.

The JSON accepted by the Responses API has many parameters, but the three most critical ones are:

instructions: A system (or developer) message inserted into the model's context
tools: A list of tools that the model can call
input: A list of text, image, or file inputs sent to the model

In Codex, the instructions field comes from the model_instructions_file in ~/.codex/config.toml. If not configured, it uses the model's built-in base_instructions.

Different model instruction files are packaged in the CLI (e.g., gpt-5.2-codex_prompt.md).

The tools field is a list of tool definitions, including tools built into Codex CLI, tools provided by the Responses API, and tools configured by the user through the MCP server:

[
  // Codex's built-in shell tool for executing commands locally
  {
    "type": "function",
    "name": "shell",
    "description": "Runs a shell command and returns its output...",
    "strict": false,
    "parameters": {
      "type": "object",
      "properties": {
        "command": {"type": "array", "description": "The command to execute", ...},
        "workdir": {"description": "The working directory...", ...},
        "timeout_ms": {"description": "The timeout for the command...", ...},
        ...
      },
      "required": ["command"],
    }
  },

  // Codex's built-in plan tool
  {
    "type": "function",
    "name": "update_plan",
    "description": "Updates the task plan...",
    "strict": false,
    "parameters": {
      "type": "object",
      "properties": {"plan":..., "explanation":...},
      "required": ["plan"]
    }
  },

  // Web search tool provided by Responses API
  {
    "type": "web_search",
    "external_web_access": false
  },

  // User-configured MCP server, e.g., weather query
  {
    "type": "function",
    "name": "mcp__weather__get-forecast",
    "description": "Get weather alerts for a US state",
    "strict": false,
    "parameters": {
      "type": "object",
      "properties": {"latitude": {...}, "longitude": {...}},
      "required": ["latitude", "longitude"]
    }
  }
]

The input field is a list of entries. Before adding the user message, Codex inserts the following:

1. A message with role=developer describing the sandbox environment, but only applicable to the shell tool built into Codex. That is, tools provided by the MCP server are not protected by the Codex sandbox and need to implement their own security mechanisms.

This message is generated using a template, with core content coming from Markdown files packaged in the CLI (such as workspace_write.md and on_request.md):

<permissions instructions>
 - Sandbox explanation, explaining file permissions and network access
 - When to request user permission to execute shell commands
 - List of folders Codex can write to (if any)
</permissions instructions>

2. (Optional) A message with role=developer whose content is the developer_instructions from the user's config.toml.

3. (Optional) A message with role=user, i.e., the "user instructions". This is not from a single file but aggregated from multiple sources. More specific instructions appear later:

Content from AGENTS.override.md and AGENTS.md in the $CODEX_HOME directory
From the Git/project root directory to the current directory, in each folder (subject to a 32 KiB limit), look for AGENTS.override.md, AGENTS.md, or files specified by project_doc_fallback_filenames
If skills are configured:
- A brief description of the skills
- Metadata for each skill
- Instructions on how to use the skills

4. A message with role=user describing the local environment where the Agent is currently running, including the current working directory and the user's shell:

<environment_context>
  <cwd>/Users/mbolin/code/codex5</cwd>
  <shell>zsh</shell>
</environment_context>

After the above calculations are completed, Codex appends the user message to input to start the conversation.

Each input element is a JSON object containing type, role, and content:

{
  "type": "message",
  "role": "user",
  "content": [
    {
      "type": "input_text",
      "text": "Add an architecture diagram to the README.md"
    }
  ]
}

After Codex builds the complete JSON, it sends an HTTP POST request to the Responses API (with the Authorization header and other headers and parameters specified in the configuration).

When the OpenAI Responses API server receives the request, it constructs the prompt from the JSON as follows:

It can be seen that the order of the first three items is determined by the server, not the client.

However, among these three, only the content of the system message is also controlled by the server; tools and instructions are determined by the client. After that is the input in the JSON, which constitutes the complete prompt.

With the prompt, the model can be sampled.

First Round of Dialogue

The HTTP request sent to the Responses API starts the first "round" of the Codex conversation. The server returns the response in the form of a Server-Sent Events (SSE) stream. The data of each event is a JSON, and the type starts with response, which may look like this:

data: {"type":"response.reasoning_summary_text.delta","delta":"ah ", ...}
data: {"type":"response.reasoning_summary_text.delta","delta":"ha!", ...}
data: {"type":"response.reasoning_summary_text.done", "item_id":...}
data: {"type":"response.output_item.added", "item":{...}}
data: {"type":"response.output_text.delta", "delta":"forty-", ...}
data: {"type":"response.output_text.delta", "delta":"two!", ...}
data: {"type":"response.completed","response":{...}}

Codex consumes these event streams and republishes them as internal event objects for client use. Events like response.output_text.delta are used to support streaming display in the UI, while events like response.output_item.added are converted into objects and appended to the input of subsequent Responses API calls.

Assuming the first request returns two response.output_item.done events: one type=reasoning and one type=function_call. When we query the model again, these events must be reflected in the input:

[
  /* ... the original 5 entries in the input array ... */
  {
    "type": "reasoning",
    "summary": [
      {
        "type": "summary_text",
        "text": "**Adding an architecture diagram for README.md**\n\nI need to..."
      }
    ],
    "encrypted_content": "gAAAAABpaDWNMxMeLw..."
  },
  {
    "type": "function_call",
    "name": "shell",
    "arguments": "{\"command\":\"cat README.md\",\"workdir\":\"/Users/mbolin/code/codex5\"}",
    "call_id": "call_8675309..."
  },
  {
    "type": "function_call_output",
    "call_id": "call_8675309...",
    "output": "<p align=\"center\"><code>npm i -g @openai/codex</code>..."
  }
]

The prompt for subsequent queries will become like this:

Note that the old prompt is an exact prefix of the new prompt. This is intentional so that subsequent requests can utilize prompt caching to significantly improve efficiency (which will be discussed later).

Looking back at our initial Agent Loop diagram, there may be many iterations between inference and tool calls. The prompt will continue to grow until we finally receive an assistant message, marking the end of this round:

data: {"type":"response.output_text.done","text": "I added a diagram to explain...", ...}
data: {"type":"response.completed","response":{...}}

In Codex CLI, we display the assistant message to the user and focus on the input box, indicating that it's now the user's turn to continue the conversation. If the user replies, the assistant message from the previous round and the user's new message must be appended to the input of the new request:

[
  /* ... all entries from the last Responses API request ... */
  {
    "type": "message",
    "role": "assistant",
    "content": [
      {
        "type": "output_text",
        "text": "I added a diagram to explain the client/server architecture."r>      }
    ]
  },
  {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "That's not bad, but the diagram is missing the bike shed."r>      }
    ]
  }
]

Because it's continuing the conversation, the length of the input sent to the Responses API will continue to increase:

Performance Considerations

You might ask: "Wait, isn't the amount of JSON sent by this Agent Loop throughout the conversation growing quadratically?"

That's right. Although the Responses API supports an optional previous_response_id to alleviate this problem, Codex currently does not use it, mainly to keep the request completely stateless and to support Zero Data Retention (ZDR) configuration.

Avoiding the use of previous_response_id simplifies the implementation of the Responses API provider because each request is stateless. This also makes it easier to support ZDR clients: storing the data required to support previous_response_id would contradict ZDR. ZDR clients can still benefit from the proprietary inference messages from previous rounds because the relevant encrypted_content can be decrypted on the server side. (OpenAI retains the decryption keys for ZDR clients but does not retain their data.)

Generally, sampling the model costs far more than network transmission costs, so sampling is the primary target for efficiency optimization. This is why prompt caching is so important: it allows us to reuse the computation from previous inference calls.

When the cache hits, sampling the model is linear rather than quadratic.

OpenAI's prompt caching documentation explains this:

Cache hits only work for exact prefix matches within the prompt. To achieve cache benefits, place static content (such as instructions and examples) at the beginning of the prompt and variable content (such as user-specific information) at the end. Images and tools must also be completely consistent across different requests.

Considering this, what operations might cause Codex's "cache miss"?

Changing the available tools midway through the conversation
Changing the target model of the Responses API request (this actually changes the third item of the original prompt because it contains model-specific instructions)
Changing the sandbox configuration, approval mode, or current working directory

The Codex team must be cautious when introducing new features that may affect prompt caching. For example, there was a bug when initially supporting MCP tools: inconsistent tool enumeration order led to cache misses. MCP tools are particularly tricky because the MCP server can dynamically change the tool list via notifications/tools/list_changed. Responding to this notification midway through a long conversation could lead to expensive cache misses.

Whenever possible, we handle configuration changes midway through a conversation by appending new messages to the end of the input rather than modifying previous messages:

If the sandbox configuration or approval mode changes, we insert a new role=developer message in the same format as the original <permissions instructions>
If the current working directory changes, we insert a new role=user message in the same format as the original <environment_context>

We do our best to ensure cache hits to improve performance. There is another key resource that needs to be managed: the context window.

Our general strategy to avoid exhausting the context window is: when the number of tokens exceeds a certain threshold, compress the conversation. Specifically, we replace the input with a smaller list of entries that represent the original conversation, allowing the Agent to continue working with an understanding of what has already happened.

The early compression implementation required the user to manually execute the /compact command, which would query the Responses API using the existing conversation plus custom summary instructions, and then use the returned assistant message as the new input for subsequent conversation rounds.

Later, the Responses API evolved a dedicated /responses/compact endpoint to perform compression more efficiently. It returns a list of entries that can replace the previous input to continue the conversation while freeing up the context window. This list contains a special type=compaction entry with an opaque encrypted_content, preserving the model's implicit understanding of the original conversation. Now, when the auto_compact_limit is exceeded, Codex automatically uses this endpoint to compress the conversation.

Future Plans

OpenAI introduced Codex's Agent Loop and explained in detail how Codex builds and manages context when querying the model. It also shared practical considerations and best practices that apply to anyone building an Agent Loop on the Responses API.

Although the Agent Loop is the foundation of Codex, this is just the beginning. In subsequent articles, they will delve into the CLI architecture, the implementation of tool usage, and Codex's sandbox model.

Related Links:

Original blog: https://openai.com/index/unrolling-the-codex-agent-loop/
Codex CLI open source repository: https://github.com/openai/codex
Codex developer documentation: https://developers.openai.com/codex/cli
Responses API documentation: https://platform.openai.com/docs/api-reference/responses