GPT-5.3-Codex Released: The First Self-Training Model

Today, another round of major news has exploded... Can't a person get some sleep!

Within the past hour, OpenAI released GPT-5.3-Codex, and Anthropic released Opus 4.6 (1 million context) Claude Opus 4.6 Released, Dominates Benchmarks, Price Unchanged.

Two heavy bombs landed almost simultaneously.

Agents built on these models are about to take off.

On the same day Anthropic released Claude Opus 4.6, OpenAI followed up with GPT-5.3-Codex, claiming it to be the strongest agentic coding model to date.

Sam Altman himself posted on X第一时间:

GPT-5.3-Codex is here!
Top-tier coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). Real-time guidance during task execution, providing live updates throughout the process. Faster! Token consumption for the same task is less than half of 5.2-Codex, and speed per token is over 25% faster! Strong computer use capabilities as well.

So, what makes GPT-5.3-Codex so strong? Let's dive deeper.

Self-Training Itself

GPT-5.3-Codex has a very "unconventional" feature: it participated in its own creation process.

During the training process, the OpenAI team used early versions of GPT-5.3-Codex to debug its own training, manage its own deployment, diagnose test results, and evaluate. In other words, this model helped "give birth" to itself.

OpenAI's research team used Codex to monitor and debug the training process for this release.

It could not only troubleshoot infrastructure issues but also track pattern changes during training, perform in-depth analysis of interaction quality, propose fixes, and even build visualization applications for researchers to precisely understand differences in model behavior.

The engineering team also used Codex to optimize and adapt the runtime environment for GPT-5.3-Codex.

When edge cases affecting users arose, team members directly had Codex locate bugs in context rendering and investigate the root causes of low cache hit rates. During the release period, GPT-5.3-Codex also helped the team dynamically scale GPU clusters to handle traffic peaks, maintaining stable latency.

A data scientist used GPT-5.3-Codex to build a new data pipeline, creating visualizations far richer than standard dashboard tools, then analyzed with Codex, extracting key insights from thousands of data points within three minutes.

Dominating All Benchmarks

GPT-5.3-Codex set new records on multiple benchmarks:

SWE-Bench Pro achieved 56.8%, a rigorous evaluation measuring real-world software engineering capabilities. Unlike SWE-Bench Verified which only tests Python, SWE-Bench Pro covers four programming languages, is more resistant to data contamination, and is closer to industrial scenarios. GPT-5.2-Codex was 56.4%, GPT-5.2 was 55.6%.

Terminal-Bench 2.0 reached 77.3%, far exceeding GPT-5.2-Codex's 64.0%. This benchmark measures the terminal operation capabilities required for coding agents.

OSWorld-Verified achieved 64.7%, while GPT-5.2-Codex only had 38.2%. OSWorld is an agentic computer use benchmark for completing productivity tasks in a visual desktop environment, and this improvement can be described as a "cliff-like lead."

GDPval achieved a win or tie rate of 70.9%, on par with GPT-5.2. GDPval is an evaluation released by OpenAI in 2025, measuring model performance on knowledge work tasks across 44 professions, including creating presentations and processing spreadsheets.

Cybersecurity CTF Challenge reached 77.6%, compared to GPT-5.2-Codex's 67.4%.

SWE-lancer IC Diamond achieved 81.4%, exceeding GPT-5.2-Codex's 76.0%.

Notably, GPT-5.3-Codex consumed fewer tokens to complete these tasks than any previous model. Stronger and more efficient - that's real capability.

More Than Just Writing Code

GPT-5.3-Codex's positioning is no longer just a code generation tool.

OpenAI states: From a coding agent to an agent that can do almost everything developers and professionals do on a computer.

Software engineers, designers, product managers, and data scientists do far more than just write code.

GPT-5.3-Codex is designed to support all work in the software lifecycle: debugging, deployment, monitoring, writing PRDs, editing copy, user research, testing, metrics analysis, etc. Its agentic capabilities even extend beyond the software domain, helping you create slides and analyze data in spreadsheets.

OpenAI combined cutting-edge coding capabilities, aesthetic improvements, and compression capabilities to create a model that can build highly functional complex games and applications from scratch within days.

To test long-running agentic capabilities, they had GPT-5.3-Codex build two games: a second version of a racing game and a diving game, using only generic follow-up prompts like "fix the bug" or "improve the game." GPT-5.3-Codex autonomously iterated through millions of tokens of interaction.

In web development, GPT-5.3-Codex also better understands user intent than its predecessor.

Simple or insufficiently detailed prompts now default to generating websites with more complete functionality and more reasonable default values, giving you a stronger starting point to implement ideas. For example, when having both generations build a landing page, GPT-5.3-Codex will automatically display the annual plan as a discounted monthly price for clearer discounts, and will also create an automatic carousel of user testimonials instead of just placing one.

Out-of-the-box completion is significantly higher.

Working While Talking

As model capabilities grow stronger, the bottleneck has shifted from "what agents can do" to "how humans can conveniently interact with, guide, and supervise multiple parallel-working agents."

GPT-5.3-Codex made a key improvement in this regard: interactive collaboration.

Previously, you gave Codex a task and waited for the final result. Now it's different. GPT-5.3-Codex provides frequent updates during the work process, keeping you informed of key decisions and progress in real-time.

You can ask questions, discuss plans, and adjust directions at any time without losing context.

It tells you what it's doing, responds to your feedback, and involves you from start to finish.

It's more like collaborating with a colleague than giving commands to a machine.

This feature can be enabled in the Codex app via Settings > General > Follow-up behavior.

First "High Capability" Safety Rating

GPT-5.3-Codex is the first model under OpenAI's Preparedness Framework to be rated as "High Capability" for cybersecurity-related tasks, and also their first model directly trained to identify software vulnerabilities.

Although there is no conclusive evidence that it can fully automate cyber attacks end-to-end, OpenAI has taken preventive measures, deploying the most comprehensive cybersecurity security stack to date, including security training, automated monitoring, trusted access for advanced capabilities, and an execution pipeline containing threat intelligence.

Because cybersecurity is inherently dual-use, OpenAI has adopted an "evidence-based, iterative" approach, accelerating defenders' ability to discover and fix vulnerabilities while slowing down misuse.

Specific measures include:

Launching the Trusted Access for Cyber pilot program to accelerate cybersecurity defense research.
Expanding the private beta testing of Aardvark (security research agent) as the first product in the Codex Security product suite.
Collaborating with open-source maintainers to provide free code repository scanning for widely used projects like Next.js. Last week, a security researcher used Codex to discover and disclose a vulnerability in Next.js.

Building on the $1 million cybersecurity grant program initiated in 2023, OpenAI has also committed to investing $10 million in API credits to accelerate cyber defense, particularly for open-source software and critical infrastructure systems.

Availability

GPT-5.3-Codex is now available to all ChatGPT paid users, covering all platforms where Codex is available: apps, CLI, IDE extensions, and the web. API access is being rolled out securely.

In terms of speed, it is 25% faster than GPT-5.2-Codex, and token consumption is less than half of its predecessor.

GPT-5.3-Codex was co-designed, trained, and deployed with the NVIDIA GB200 NVL72 system.

Direction Has Changed

OpenAI stated at the end of the article:

GPT-5.3-Codex moves Codex from "writing code" to "using code as a tool to operate computers and complete work end-to-end."

Initially focused on becoming the best coding agent, it has now evolved into a more general-purpose computer collaborator, expanding the boundaries of who can build and what can be done with Codex.

On the same day, Anthropic released Opus 4.6, and OpenAI released GPT-5.3-Codex. The arms race between the two on the agentic coding track has entered a white-hot phase.

It's also already usable in the CLI:

And the direction is becoming clearer: It's not about making the model write more code, but about making the model use code to accomplish everything.

Another notable point: GPT 5.3 Codex was officially released today, and just hours earlier, the AI agent platform Frontier was also released simultaneously.

What does this shortened release cycle mean?

OpenAI has released 5 major versions/updates in the past 6 months, compared to only 7 versions in total in the previous 15 months.

For increasingly complex models, as per OpenAI's own announcements, more and more #AI-generated code is being used to build them. This is either due to speed improvements from genuine functional code development improvements, or acceleration through more quality assurance under competitive pressure.

This time, GPT-5.3-Codex participated in its own training process...

Interesting.