Breaking Away from NVIDIA's Ecosystem: OpenAI Releases New Programming Model GPT-5.3-Codex-Spark, Speed Reaches 1000 Tokens per Second

Just now, OpenAI released a new programming model that runs on a chip the size of a dinner plate and can output over 1000 tokens per second.

Image

This model is called GPT-5.3-Codex-Spark, a lightweight version of GPT-5.3-Codex, designed specifically for real-time programming.

Here is the comparison video:

And Sam Altman himself gave a preview before the release: "It sparks joy for me".

Image

The secret to its speed is not Nvidia's GPU, but Cerebras's wafer-scale chip.

Fast

Traditional AI inference involves distributing the model across a stack of GPUs, which requires frequent communication between chips, and communication is latency.

Cerebras's Wafer Scale Engine 3 (WSE-3) takes a completely different approach: making the entire chip into a single wafer.

How big is this chip?

About the size of a dinner plate.

And it has 4 trillion transistors.

It has the largest on-chip memory among all current AI processors, directly eliminating the communication overhead between multiple chips. The model runs on a single chip, without the need to move data back and forth between chips.

The result is: inference speed is directly pulled to over 1000 tokens per second, which is about 15 times faster than traditional GPU inference.

For programming scenarios, this means that by the time you type, the model has already written the code simultaneously.

Near real-time feedback, the code flows out from your fingertips.

Not Just Fast

Codex-Spark is not just a "small model that runs fast".

Image

Image

On the two mainstream software engineering agent benchmarks, SWE-Bench Pro and Terminal-Bench 2.0, Codex-Spark's performance surpassed GPT-5.1-Codex-mini, and the time to complete tasks was only a fraction of the latter.

Fast and powerful!

OpenAI's positioning for it is: this is a productivity tool for daily programming, used for rapid prototyping, real-time collaboration, and instant iteration.

You can interrupt and redirect it at any time during its code writing process, and it responds almost instantly.

The larger and more powerful GPT-5.3-Codex is responsible for handling complex tasks that require deep reasoning and long execution times.

OpenAI's vision is to let the two models complement each other: Spark handles speed, Codex handles depth.

OpenAI's Chip Ambition

This is the first milestone in the collaboration between OpenAI and Cerebras.

In January this year, OpenAI announced a multi-year collaboration plan with Cerebras, worth over $10 billion.

And Cerebras has just completed a funding round of over $1 billion, with a valuation of about $23 billion, and is considering an IPO.

The significance of this collaboration is not just a new model.

This is the first time OpenAI has massively broken away from the NVIDIA ecosystem at the inference level.

In the past, almost all large model companies' inference ran on Nvidia GPUs, and Codex-Spark proves one thing: for specific scenarios (like programming), specialized chips can elevate the experience to a completely different level.

Cerebras CTO and co-founder Sean Lie said:

What excites us most is exploring what fast inference can bring with OpenAI and the developer community—new interaction modes, new usage scenarios, fundamentally different model experiences. This preview version is just the beginning.

Image

Cerebras stated that by 2026, it will extend this ultra-fast inference capability to the largest frontier models.

How to Use

Currently, GPT-5.3-Codex-Spark is released in a research preview form, available to ChatGPT Pro users, and can be used through the following channels:

  • Codex Application

  • CLI Command Line Tool

  • VS Code Plugin

Image

Sam Altman admitted that there are still some limitations at release, but the team will "rapidly improve".


Related Links:


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.