Before the Spring Festival arrives, overseas large models have come with a wave of strong, head-to-head releases.
On the morning of February 6th Beijing time, Anthropic and OpenAI successively launched new versions of their foundational large models: Claude Opus 4.6 and GPT-5.3-Codex.
Yesterday, the two companies were still debating over AI advertising, and today they've collided again on large model releases. Without further ado, let's directly examine their model capabilities.
Claude Opus 4.6
Claude Opus 4.6 is a major upgrade to Anthropic's flagship AI model. This generation of models features more cautious planning, can maintain longer autonomous workflows, and has surpassed competitors including GPT-5.2 in key enterprise benchmarks.
The new model first boasts a 1 million token context window, enabling the AI to process and reason over far more information than previous versions. Anthropic has also introduced an "agent team" feature in Claude Code, similar to Kimi K2.5—a research preview feature that allows multiple AI agents to work on different aspects of a coding project simultaneously and coordinate autonomously.
Anthropic emphasizes that Opus 4.6 can apply its enhanced capabilities to a range of everyday work tasks, including running financial analysis, conducting research, and using and creating documents, spreadsheets, and presentations. Now in the Cowork environment, Claude can autonomously execute multiple tasks, and Opus 4.6 can apply all these skills on behalf of humans.
Opus 4.6 has performed exceptionally well in multiple evaluations. For example, it achieved the highest score in the agent coding evaluation tool Terminal-Bench 2.0 and led all other frontier models in "The Last Exam" (a complex multi-disciplinary reasoning test). In GDPval-AA (a test evaluating models' performance on economically valuable knowledge work tasks in finance, law, and other fields), Opus 4.6's performance was about 144 Elo points higher than the industry's second-best model (OpenAI's GPT-5.2) and 190 points higher than its predecessor (Claude Opus 4.5). Additionally, Opus 4.6 outperformed all other models in the BrowseComp test, which measures a model's ability to find hard-to-find information online.
Claude Opus 4.6 is now available on claude.ai, APIs, and all major cloud platforms, with pricing unchanged at $5 / $25 per million tokens.
A common issue with current large models is "context decay," where model performance degrades when the number of conversation tokens exceeds a certain threshold. Opus 4.6's performance is significantly better than its predecessor: in the MRCR v2 8-needle 1M variant test (which is like finding a needle in a haystack), Opus 4.6 scored 76%, while Sonnet 4.5 scored only 18.5%. This marks a qualitative leap in the amount of context information the model can utilize while maintaining optimal performance.
To demonstrate Opus 4.6's powerful agent capabilities, an Anthropic researcher used 16 agents to build a Rust-based C language compiler from scratch, setting the task and then essentially letting it run on its own. The final AI-generated code was 100,000 lines long, could compile the Linux kernel, cost $20,000, involved over 2000 Claude Code sessions, and took two weeks.
The compiler can build bootable Linux 6.9 on x86, ARM, and RISC-V, passes 99% of GCC's stress tests, can compile FFmpeg, Redis, PostgreSQL, QEMU, and also passes the developer's ultimate test: compiling and running the Doom game.
The compiler's code: https://github.com/anthropics/claudes-c-compiler
Although no humans participated in writing the code, the researchers continuously redesigned tests, built CI pipelines when agent programs interfered with each other, and created workarounds when all 16 agent programs got stuck on the same bug.
It seems that in future workflows incorporating AI, the human role has shifted from writing code to building environments that allow AI to write code.
GPT-5.3-Codex
On OpenAI's side, the release of the new generation model GPT-5.3-Codex followed closely. Sam Altman stated that it possesses the best coding performance currently available, further unleashing the potential of Codex.
GPT-5.3-Codex has set new records on multiple benchmarks: reaching 56.8% on SWE-Bench Pro and 77.3% on Terminal-Bench 2.0, while running faster and consuming fewer tokens than previous versions.
OpenAI stated that this model combines the cutting-edge coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT-5.2, with a 25% speed increase. This enables it to handle long-duration tasks that require research, tool use, and complex execution.
It acts like a real colleague; you can guide and interact with GPT-5.3-Codex while it works without losing context. With GPT-5.3-Codex, Codex has evolved from an agent capable of writing and reviewing code to an agent that can almost perform any operation that developers and professionals do on a computer.
In addition to more powerful coding capabilities, GPT-5.2-Codex has made significant progress again in the aesthetic aspects that OpenAI has long focused on.
In this release, OpenAI had GPT-5.3-Codex build two games: a second version of the racing game launched when Codex was released, and a diving game.
OpenAI stated that GPT-5.3-Codex utilized its web game development skills and pre-set general follow-up prompts (such as "fix bugs" or "improve the game"), autonomously iterating through millions of tokens in development.
In this release of GPT-5.3-Codex, OpenAI's expectations for it go far beyond being just an intelligent coding model; it is an agent capable of "Beyond coding," realizing a work assistant.
GPT-5.3-Codex can support all work in the software lifecycle—debugging, deployment, monitoring, writing product requirement documents, editing copy, user research, testing, metrics analysis, and more.
Example of GPT-5.3-Codex output net value analysis table
OpenAI believes that as model capabilities continue to increase, the gap is no longer just about what agents can do, but rather how easily humans can interact with, guide, and supervise multiple agents working in parallel. Given this, the Codex application makes managing and guiding agents more convenient, and the addition of GPT-5.3-Codex makes its interactivity even stronger.
With the new model, Codex updates frequently to keep you informed of key decisions and progress. People can interact in real-time without waiting for the final output—asking questions, discussing methods, and exploring solutions together. GPT-5.3-Codex will voice its running process, respond to feedback, and keep you in control of the entire process from start to finish.
Finally, OpenAI stated that GPT-5.3-Codex's training and deployment used Codex, and many of OpenAI's researchers and engineers have said that their work has fundamentally changed compared to two months ago.
For example, the research team used Codex to monitor and debug this version's training run. It not only accelerated the debugging of infrastructure issues but also helped track patterns throughout the training process, conduct in-depth analysis of interaction quality, propose fixes, and build rich applications that allowed researchers to precisely understand the differences in model behavior compared to previous models.
The engineering team used Codex to optimize and adapt the GPT-5.3-Codex framework. When extreme abnormal situations affecting users occurred, team members used Codex to identify context rendering errors and find the root causes of low cache hit rates. Throughout the release process, GPT-5.3-Codex supported the team by dynamically scaling GPU clusters to handle traffic peaks and maintain stable latency.
During the Alpha test, a researcher wanted to understand how much additional work GPT-5.3-Codex could complete per round and the resulting productivity gains. GPT-5.3-Codex generated several simple regex classifiers to estimate the frequency of user clarification requests, positive and negative feedback, and task progress, then applied these classifiers scalably to all session logs and generated a report with conclusions.
GPT-5.3-Codex is already included in ChatGPT's paid plans, but the API will need to wait for a while.
OpenAI reported that due to improvements in the infrastructure and inference stack, Codex users now run GPT-5.3-Codex 25% faster, enabling faster interactions and quicker results.
Conclusion
Overseas large models have taken turns on the stage, and in the last few days before the Spring Festival, domestic large models will inevitably become more competitive, including DeepSeek v4, which may be coming soon.
Are you looking forward to it?
References:
https://www.anthropic.com/news/claude-opus-4-6