Google Gemini 3.1 Pro Dominates Benchmarks, Tsinghua's Yao Shunyu Strikes! Claude and GPT Forced into a Corner

New Intelligence Yuan Report

Editors: Haokun, Taozi

[New Intelligence Yuan Briefing] Google DeepMind dropped a bombshell late at night, officially unveiling the next-generation flagship model Gemini 3.1 Pro. In the notoriously difficult ARC-AGI-2 test, it achieved the highest score stunning Silicon Valley, with reasoning capabilities doubling and dethroning Claude 4.6.

Following Gemini 3 Pro, Google DeepMind has finally unleashed its ultimate move!

Just now, the next-generation flagship model Gemini 3.1 Pro made a late-night debut, shattering SOTA records across all domains to become the new AI king.

Following Deep Think, Tsinghua alumnus Yao Shunyu also participated in the development of Gemini 3.1 Pro.

This time, Gemini 3.1 Pro represents an epic leap in large-scale model reasoning capabilities.

In the extremely rigorous ARC-AGI-2 test, it achieved a remarkable score of 77.1%, with performance soaring to more than double that of the previous generation 3.0 Pro.

Coupled with a near-perfect score (98%) on ARC-AGI-1, whether it's the reasoning-heavy Claude Opus 4.6 or the specially tuned GPT-5.2, all have been left in the dust.

From the SVG comparison test below, one can intuitively feel the massive generation gap in capabilities between 3.1 Pro and 3 Pro.

In coding and reasoning domains, Gemini 3.1 Pro similarly dominates, comprehensively crushing Sonnet 4.6 and GPT-5.2.

In the AAII comprehensive evaluation, 3.1 Pro topped the charts, not only leading Claude Opus 4.6 by a full 4 points in total score but also costing less than half in API call expenses.

Starting today, Gemini 3.1 Pro is officially available in Gemini and NotebookLM. Developers can get early access through Google AI Studio, Antigravity, and Android Studio.

Now, the AI battlefield in Silicon Valley has fundamentally shifted, with only heavyweight players Google DeepMind and Anthropic left to face off.

OpenAI, previously enjoying the limelight, seems to be gradually losing its initiative on this main battlefield.

Gemini 3.1 Pro Late Night Raid

Comprehensive SOTA Scores Doubled

As Google's most formidable model to date, 3.1 Pro achieves a comprehensive leap beyond 3 Pro.

It not only possesses native full-modal input capabilities but also supports super-long contexts of up to 1 million tokens.

In the performance benchmarks most closely watched by the industry, Gemini 3.1 Pro demonstrates breathtaking dominance.

In the Humanity's Last Exam (HLE), Gemini 3.1 Pro achieved 44.4% without tool assistance, cornering GPT-5.2 (34.5%) and Opus 4.6 (40.0%).

In the ARC-AGI-2 test, Gemini 3.1 Pro achieved a heaven-defying score of 77.1%, leaving Opus 4.6 (68.8%), which had just reached the top two days ago, trailing behind.

Even more shocking is its quantum leap evolution in code and AI agent domains.

In LiveCodeBench Pro, it chopped down an Elo score of 2887, leaving peers in the dust;

In Terminal-Bench 2.0, with a score of 68.5%, it suppressed the code-specialized GPT-5.3-Codex (64.7%);

In APEX-Agents, it achieved a commanding 33.5%, compared to Opus 4.6's 29.8% and GPT-5.2's mere 23.0%.

Beyond hardcore reasoning, Gemini 3.1 Pro also shows its muscle in processing lengthy texts.

In the MRCR v2 128k long-context test, it directly achieved a high score of 84.9%.

More terrifyingly, it exclusively supports the ultimate test of 1M tokens with a score of 26.3%, while competing GPT-5.2 and Opus 4.6 simply show "not supported" at this level.

More importantly, compared to the previous generation, 3.1 Pro has significantly reduced hallucination rates.

Hand-Crafted God-Level Applications, This is the Killer AI

What 3.1 Pro brings is not just benchmark crushing but a comprehensive evolution in logical reasoning capabilities.

Now, it can not only crack extremely tricky logic puzzles but also demonstrates stunning productivity reshaping capabilities in practical applications.

Whether transforming obscure concepts into intuitive diagrams, condensing massive data into clear charts, or turning wild creativity into reality, 3.1 Pro handles them all with ease.

Code-Based Animation

With just a simple text prompt, 3.1 Pro can directly generate SVG animations that can be seamlessly embedded into web pages.

The most amazing part is that these pure-code constructed animations not only support infinite scaling with absolute clarity but also have incredibly small file sizes compared to traditional videos.

若影片無法播放，請改看來源頁。

Integrating Complex Systems

Powerful reasoning capabilities also allow 3.1 Pro to completely break down barriers between complex APIs and human-friendly design.

For example, it can directly build a real-time aerospace data dashboard, perfectly connecting to open telemetry data streams to clearly display the real-time operational trajectory of the International Space Station before your eyes.

若影片無法播放，請改看來源頁。

Interactive Design

3.1 Pro can even write extremely complex 3D starling murmuration effects in pure code, creating an entire immersive experience for you.

In this system, you can "conduct" the flock in real-time through gesture tracking technology while hearing generative music that evolves in real-time with the flock's dynamics.

This is absolutely a powerful tool for researchers and designers developing multimodal interactive interface prototypes.

若影片無法播放，請改看來源頁。

Creative Programming

More interestingly, 3.1 Pro can transform classic literary themes into truly executable exquisite code.

For example, when asked to design a modern-style personal homepage for "Wuthering Heights," the model not only precisely captured the oppressive and profound atmosphere of the original work but also generated a minimalist and modern interface, perfectly grasping the soul of the protagonist.

若影片無法播放，請改看來源頁。

Stunning First Tests Across the Web, Dominating SVG

Google UX Engineer Michael Chang directly put it to the test, using 3.1 Pro to simulate complex urban planning and instantly generating and designing a brand-new city bird's-eye topology.

若影片無法播放，請改看來源頁。

With just a one-sentence prompt, 3.1 Pro produced an 11-second SVG animation within just 3 minutes.

In another SVG test, its generated "seal balancing a ball" is also visually stunning.

AI expert Simon Willison tested it by having 3.1 Pro generate a clear pelican SVG with legs outlined within 5 minutes.

In 3D spatial reasoning, 3.1 Pro is also the new SOTA.

The 3D pixel-version Pokemon generated by 3.1 Pro is far superior to 3.0 Pro.

Additionally, 3.1 Pro can generate optimal interactive animations showing the entire process of a seed sprouting and growing into a big tree.

Evolution Has No End, Only a Stronger Next Chapter

Starting today, the Gemini 3.1 Pro preview is officially released—this is just the beginning.

Google stated that from November last year to today, authentic user feedback has accelerated every iteration.

Gemini 3.1 Pro's late-night raid is another reshaping of the AI industry landscape.

With this nearly "muscle-flexing" iteration speed, Google DeepMind tells the world—

In the deep waters leading to AGI, only players with tightly coupled hardware compute and algorithmic depth can secure tickets to the second half.

References:

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

https://x.com/Google/status/2024519455389192204?s=20

https://deepmind.google/models/model-cards/gemini-3-1-pro/

Google Gemini 3.1 Pro Dominates Benchmarks, Tsinghua's Yao Shunyu Strikes! Claude and GPT Forced into a Corner

Related Articles

分享網址