Claude Opus 4.6 Review

Short Conclusion: A Specialist Finally Becomes an All-Rounder

Basic Situation:

Anthropic has a unique understanding of tasks. Since the 2.0 era, it has been cultivating creative writing and data analysis capabilities for office white-collar workers. By the 3.7 era, it successfully mastered the coding technology line, making the Claude series synonymous with Vibe Coding. By the 4.5 era, Sonnet and Opus had an almost unshakable position in coding and data analysis.

OpenAI's offensive has been fierce. The GPT series may not have prioritized coding before, but that doesn't mean the high-IQ GPT can't learn to code. Facing GPT's high-to-low approach, Claude's flagship Opus must shoulder the burden, raise its raw intelligence, and prepare for 2026.

The good news is that Opus 4.6 is comparable to GPT-5.2 in most aspects, with equivalent mathematical and logical intelligence, and even better Agent capabilities.

The bad news is that this requires nearly 2x the cost. Considering that Agent applications are token black holes, the actual cost difference will be even higher.

Logic Scores:

Benchmark Chart

*1 The table only shows some comparable models to highlight comparisons, not a complete ranking.

*2 For questions and testing methods, see: Large Language Model - Logic Capability Cross-Review 26-01 Monthly Ranking. New question #56 added.

*3 The complete ranking is updated at https://llm2014.github.io/llm_benchmark/

The following focuses on comparing Opus 4.6 reasoning mode with GPT-5.2 reasoning mode. When mentioning non-reasoning mode, it will be specifically marked.

Advantages:

Character Processing: Character processing capability has always been Claude's signature strength. The 4.6 generation has improved further. The #41 scrambled text parsing that stopped many models in their tracks was passed by Opus for the first time with over half of the cases, and even in non-reasoning mode, half were passed. Opus's floor is higher than GPT-5.2's ceiling. In the January new question #55 obstacle map problem, getting full marks requires solid character processing skills. Previously, GPT-5.2 was the best model but made several small errors, while Opus got full marks in 1 Pass, and only 1 error in the other 2 Passes. Opus's performance in character processing typically leads other models by more than 8 months.

Calculation: Calculation was originally a weak point for Opus in non-reasoning mode. The 4.5's related scores were even lower than domestic models in the same tier. But 4.6 turned things around, with significantly improved calculation precision. While it can't get full marks on related questions, it can consistently stay at a high level with only minor decimal errors. The reasoning mode naturally achieves stable full marks, with complex calculation performance even better than GPT-5.2.

Complex Reasoning: On problems requiring certain thinking skills and problem-solving methodology, such as Sudoku, variant Sudoku, ARC-AGI analogues, etc., Opus has clearly received special training with significantly improved solving efficiency. Question #49 previously only GPT-5.2 could achieve full marks, now Opus can also consistently get full marks, and even non-reasoning mode can occasionally achieve full marks. However, as such questions gradually decrease in the author's testing, Opus's future score expectations will slightly decline.

Insight: As a model designed for office white-collar workers, data processing capability is an unavoidable requirement. This includes data insight and pattern recognition problems. This was previously GPT's leading area, but Opus is gradually catching up. On related questions, Opus's score situation is the same as GPT-5.2. However, Opus typically needs to consume 20%~130% more tokens, so it's still somewhat behind in efficiency.

Weaknesses:

Hallucination: Opus's hallucination is slightly higher than GPT. After all, low hallucination is one of OpenAI's long-term technical barriers, and it's not easy for Anthropic to catch up. The distribution of hallucinations is not closely related to context length. Even in short texts of only a few thousand characters, Opus has a significant probability of oversight, missing some text and numbers, leading to errors in final results. GPT, on the other hand, can consistently get correct results with multiple passes. Question #42 annual report organization has longer text and requires extracting more information, so Opus scores lower. Such problems can certainly be solved with search tools in Agent mode, but hallucinations also affect self-generated context. Therefore, it can be observed that Opus's stability significantly decreases on problems that don't require high intelligence but have particularly many intermediate steps, unable to maintain stable scores.

Cyber Historian Says:

Claude is supplementing intelligence, GPT is supplementing engineering thinking in coding. OpenAI is focusing on model safety, stability, low hallucination, and per-token efficiency—these more fundamental and infrastructure-level tasks. Anthropic has enabled multi-Agent collaboration, self-evolution, and long context, racing down the main road toward having large models replace traditional office software. Seemingly converging, but actually diverging. Both top AI companies have clear plans for AI's future. These two companies aren't competing redundantly but exploring separately, which is fortunate for the times.

Opus 4.6 will join the coding engineering test, with results being uploaded to the website in the coming days.

Claude Opus 4.6 Review

Related Articles

分享網址