GPT-5.4 Released Late at Night: The Chosen Model for OpenClaw Has Arrived

It was 2 AM late at night, and I was just about to go to sleep.

Then, GPT-5.4, suddenly released.

I was suddenly so excited that I couldn't sleep.

Really, it's not that I make a big fuss about everything every day. I rarely use expressions like "so excited I can't sleep."

This is because I've been waiting for the official version of GPT-5.3 or GPT-5.4 to serve as the primary model for my OpenClaw.

The reason is quite simple: for the past thirty years of the modern world, the foundation has essentially been code. Almost everything we see regarding computers and the internet is built upon code.

So you can understand that coding ability, in many ways, represents a sturdy leg of Agent capabilities.

In my understanding, an excellent Agent foundation model generally needs to be strong in three areas:

Coding ability, world knowledge, and multimodal understanding.

When you achieve SOTA in all three, you are almost inevitably the most powerful Agent model. Of course, another important factor is price.

In the past, Claude Opus 4.6 was almost synonymous with Agent models because its coding and world knowledge were both strong. Although its multimodal capabilities couldn't compare to Seed 2.0 and Gemini 3.1 Pro, in some scenarios, it was sufficient because current Agents don't interact that much with physical reality—that's already the domain of embodied intelligence.

And GPT-5.3-Codex, which I really liked before, had truly strong coding abilities. When executing tasks, it was precise and responsive.

But the biggest problem was that this thing was a programming-specialized model. Its world knowledge was garbage, even worse than GPT-5.2. So OpenAI had no choice at the time; to compete with Claude, they could only release it with a Codex suffix.

So you'll find that in terms of planning capabilities, it couldn't compare to Claude Opus 4.6 at all. But the biggest issue was actually the world knowledge problem, which led to...

It spoke in riddles. The things it said—really, coming from a non-programmer background, reading that was incredibly difficult.

For example, I previously asked it to review one of my AI hotspot website projects, mainly to review my documentation standards and my entire codebase.

Then, the documentation this guy wrote... holy cow...

Now compare it with what Claude Opus 4.6 wrote.

The comparison should be crystal clear...

Because this thing didn't speak human language and its world knowledge was lacking, it was fine to use within Codex, but if you connected it to your OpenClaw as a default model, you'd know what a disaster it was. This guy had almost no human touch; when it spoke, I wanted to punch it.

So I tried it and immediately abandoned it. I went back to using Claude Opus 4.6 and Sonnet 4.6 in my OpenClaw for scenario-based calls.

So why was I looking forward to GPT-5.4?

Because Claude is great in every way, but... it's expensive!!!

It's really, really expensive!!!!!!

And because Anthropic, those idiots, blocked OpenClaw, the Claude Max Plan subscription I paid for can't be used for OpenClaw at all. It can only be used in Claude Code. If you want to use it on OpenClaw, you have to use an API key directly.

But everyone knows how expensive Claude's API is. It's simply not something our poor team can afford. Small-scale usage is fine, but large-scale usage would bankrupt the company directly.

Previously, there was another option using a reverse proxy to proxy Claude quota from Google's Antigravity using a plugin and use it in OpenClaw.

But later Google started banning accounts in bulk, making that unusable too.

My Google account was also banned during the New Year, and I was forced to use AI to write a tearful email to Google.

I said I was wrong and I wouldn't do it again.

Later Google unbanned me, but the reverse proxy definitely couldn't be used anymore.

But OpenAI is different. When Claude was crazily banning OpenCode accounts, OpenAI stepped up and said they wouldn't ban anyone—everyone should use it to their heart's content.

This is the only one among the big three with such a supportive attitude, allowing third-party tools to use Codex quota.

Naturally, OpenClaw is no exception. It's one of the few top-tier models that can directly use login credentials, while others require API.

Really, OpenAI is truly a saint this time.

They're also crazily adding quota to Codex.

So, using Claude in OpenClaw is great, but you can't use subscription quota—only API, which is ridiculously expensive.

OpenAI's models can use subscription quota, but GPT-5.2's coding isn't good, and GPT-5.3-Codex doesn't speak human language.

See, it's as awkward as it gets.

But this time, GPT-5.4 is here!!!

Finally, this shortcoming has been addressed!

Coding ability on par with GPT-5.3-Codex, world knowledge stronger than GPT-5.2, and can use subscription quota—$20 gets you a fantastic experience.

Tell me, if this isn't the chosen model for OpenClaw, then what is? Hmm?

Starting today, everyone using OpenClaw, switch your default model to GPT-5.4. Really, trust me.

Returning to GPT-5.4, as usual, let's look at the benchmarks first.

Very satisfying.

Let's look at the most critical ones first.

GDPval: 83.0%

This measures AI performance in real work tasks, including knowledge work across 44 professions such as finance and law.

GPT-5.4 Thinking scored 83.0%, Claude Opus 4.6 scored 78.0%, and GPT-5.3 Codex scored 70.9%.

In real business scenarios, GPT-5.4 doesn't just write code—it can discuss business, finance, law, and various professional fields with you.

And it does so in human language, not in riddles.

SWE-Bench Pro: 57.7%

This measures AI's ability to solve real software engineering problems, not just Python, but four programming languages.

GPT-5.4 Thinking scored 57.7%, and GPT-5.3 Codex scored 56.8%.

Basically the same.

This is exactly the result I wanted to see.

Coding ability preserved at GPT-5.3 Codex level, and world knowledge has caught up.

OSWorld-Verified: 75.0%. This measures AI's ability to operate computers—having AI use mouse clicks, keyboard inputs, and switch between different applications like a human to complete various tasks.

GPT-5.4 Thinking scored 75.0%, surpassing Claude Opus 4.6's 72.7%, and maintained parity with GPT-5.3-Codex.

Moreover, GPT-5.4 operates computers at an incredibly fast speed.

Watch this unaccelerated video for a more直观view.

ToolAthon: 54.6%

This measures AI's ability to use tools, which is one of the core indicators of Agent capability.

GPT-5.4 Thinking scored 54.6%, while Claude Sonnet 4.6 scored 44.8%.

A difference of nearly 10 points.

As for academic knowledge and such, it's incomparable to GPT-5.3-Codex because OpenAI knew this themselves, so they didn't even run those tests at the time.

In short, translated into plain language:

GPT-5.4 = GPT-5.3 Codex's coding ability + world knowledge stronger than GPT-5.2 + stronger tool usage capabilities + super cheap Codex quota.

Combined, these four elements make for a perfect OpenClaw chosen foundation model.

Then there are several great feature updates:

1. 1 million token context window.

This is a major upgrade for GPT-5.4.

The previous GPT-5.3 context window was 400,000 tokens. GPT-5.4 more than doubled it to 1 million.

This is incredibly important for Agents.

Because when an Agent executes tasks, it needs to maintain understanding of the entire task context. If the context window isn't large enough, the Agent will forget things as it works—what was said earlier won't be remembered later.

1 million tokens is basically sufficient for the vast majority of Agent tasks.

Of course, OpenAI isn't stupid. They say that after exceeding 270,000 tokens, your quota counts as double.

However, since Codex gives such a generous quota, even at 2x, it's still fine.

2. Native computer use capability.

This is another major selling point of GPT-5.4.

OpenAI says GPT-5.4 is their first mainline model with built-in native computer use capability.

It performs excellently in writing code to operate computers through libraries like Playwright, and can also issue mouse and keyboard commands based on screenshots.

That means code and vision working together. I feel that once this is integrated into the crayfish, it can truly use vision to natively control most software on your computer. Just thinking about it is exciting.

Based on this, they also released a new skill called playwright-interactive.

It allows Codex to debug Web and Electron applications using both code and visual methods simultaneously.

The URL is here, you can install it yourself.

https://github.com/openai/skills/tree/main/skills/.curated/playwright-interactive

3. Support for tool search.

Previously, when a model was given tools, all tool definitions would be pre-included in the prompt.

For systems with large numbers of tools, this could add thousands or even tens of thousands of tokens to each request. Most of the time, this was meaningless, causing costs to rise and responses to slow down, while filling the context with information the model might never use.

So this time they also support tool search, meaning GPT-5.4 no longer directly receives complete tool definitions, but instead receives a lightweight list of available tools along with tool search functionality.

When the model needs to use a certain tool, it can look up that tool's definition and append it to the conversation at that time.

This is very similar to the Skills progressive presentation approach, and the purpose is simple: optimizing context engineering.

After testing, OpenAI found that the tool search configuration reduced total token usage by 47% while maintaining the same accuracy rate. This is really impressive.

That's about it for GPT-5.4 Thinking.

They also released a GPT-5.4 Pro this time, but I won't go into detail. It's more powerful in every way, but for most people, it's too expensive and not very useful—you need the $200 Pro membership to use it.

The overall API pricing should still be mentioned, although most people will probably use subscription quota.

Compared to GPT-5.2, the price has increased, but it's still much cheaper than Claude Opus 4.6. Claude Opus 4.6 is priced at $5/$25 per million tokens (input/output), while GPT-5.4 is only half of that.

ChatGPT is already online.

Codex is also supported now. I briefly tested it in Codex myself.

First thing that hits you is naturally the refreshingly human language...

For example, I asked it to scrape the video from OpenAI's official website. Look at its response: "This kind of work is so annoying," "saves me from wasting time with Cloudflare"...

And this one.

Really, I can finally understand Codex's output...

The frontend aesthetics have improved a bit, but still not as good as Opus 4.6 and Gemini.

I briefly tested the writing—it still has that weird habit of loving to use parallel sentences.

Quite strange.

Unfortunately, I waited until after 6 AM, and OpenClaw still doesn't support GPT-5.4 via Codex login.

This means I still haven't had the chance to test GPT-5.4's performance on the crayfish.

But I estimate that after I wake up, the crayfish will probably support it.

Because many users in the community are already urging for it, and early adopters generally report good results.

Waiting for support, I really can't wait.

Another happy night.

If you're also using OpenClaw, remember to switch the default model to GPT-5.4 once OpenClaw supports it.

If you haven't used OpenClaw yet, now is a great time to start.

After all, with this chosen model GPT-5.4, the experience will only be better.

2026 is truly a crazy year.

Time to sleep.

That's all. Since you've made it this far, if you think this is good, feel free to like, watch, and share. If you want to receive push notifications first, you can also give me a star. Thank you for reading my article. See you next time.

>/ Author: Kazik

>/ For submissions or tips, contact email: wzglyay@virxact.com

GPT-5.4 Released Late at Night: The Chosen Model for OpenClaw Has Arrived

Related Articles

分享網址