Stunning Reversal in World's T hardest Exam: Dark Horse AI Breaks 36% Barrier as Top Models Crash

Reported by New Zhiyuan

Editor: Aeneas KingHZ

[New Zhiyuan Digest] Just yesterday, ARC-AGI-3 thoroughly embarrassed the world's top large language models. Yet today, a relatively unknown company has dropped a bombshell: their AI achieved a score of 36.08% on day one! How did this dark horse tear open the iron curtain of the world's most difficult AI exam? Is this a genuine breakthrough, or is there more to the story?

A Stunning Reversal!

Just yesterday, ARC-AGI-3, the most difficult test for AI, made its debut, and global large models were decimated overnight.

Opus 4.6, the strongest top-tier model, managed only a pathetic 0.2%. Meanwhile, humans surged far ahead, achieving perfect scores.

This shocked onlookers: from Jensen Huang to the inventor of the AGI concept himself, many believed we had already reached AGI. Are we truly that far away?

Unexpectedly, within just 24 hours, ARC-AGI-3 was cracked!

Moments ago, a company named Symbolica announced:

"Using the Agentica framework, we achieved a score of 36.08% on the ARC-AGI-3 test on the first day, completely crushing the CoT model baseline."

Out of 182 levels, they have successfully cleared 113. Of the 25 available games, they completed 7.

The world's most difficult exam has been breached!

Symbolica's Day-One Surprise: Surging to 36%

Just as people were sighing over Opus 4.6's pitiful 0.2% score, even beginning to doubt whether "AGI is merely a fantasy woven by big tech," salvation arrived in the form of a surprise.

Why was Symbolica's Agentica framework able to deliver such a stunning 36.08% score on the very first day of ARC-AGI-3's release?

Agentica (by Symbolica) is a specialized agent system built on Symbolica for ARC-AGI-3.

Consider that in the face of ARC-AGI-3's近乎sadistic scoring formula—(Human Steps / AI Steps)^2—the leading large models were still spinning their wheels in the fog. A score of 36.08% is nothing short of a dimensional strike.

To understand why Symbolica won, one must first understand how Opus 4.6 and GPT-5.4 lost.

The biggest difference between ARC-AGI-3 and its predecessors is that it is not "static image captioning" but an interactive black-box game.

When a pure LLM-based agent enters the game, its fatal flaw is attempting to use association instead of logic, and pattern matching instead of experimentation.

When facing unknown environments, large models rely on their vast pre-trained knowledge bases to "fill in the blanks." Seeing red squares and blue lines, they might associate them with "Sokoban" or "water level balancing," and then frantically output Chain-of-Thought (CoT) based on this incorrect hypothesis.

If the hypothesis is wrong, the model does not stop to reflect; instead, it runs further down the wrong path until its step count is exhausted and its score drops to zero.

ARC-AGI-3 specifically targets these AI weaknesses, measuring three key capabilities in environments that are 100% solvable by humans:

Efficiency of skill acquisition over time
Long-range planning capabilities under sparse feedback
Multi-step, experience-driven adaptability

Symbolica's Agentica framework, however, has taken a completely different technical path!

Agentica natively supports multi-agent architectures and features design-level parallelism. It automatically decomposes complex tasks into sub-problems and delegates work to sub-agents to be completed in parallel.

This means agents can maintain high efficiency and complete tasks faster out of the box!

Agentica is a type-safe AI framework that allows LLM agents to integrate seamlessly with code, including functions, classes, active objects, and even entire SDKs.

Previously,凭借powerful long-range reasoning tasks, Symbolica achieved SOTA results on ARC-AGI-2, with the Agentica SDK playing a crucial role.

Core Secret: Arcgentica RLM Harness

From the GitHub page, specifically in the IDEA.md file, we discovered Agentica's secret weapon—the ARC-AGI-3 Agent Framework (Agent Harnesses).

GitHub Address: https://github.com/symbolica-ai/ARC-AGI-3-Agents

Agent Harnesses have been an absolute buzzword recently, constantly mentioned in Anthropic's official blog and discussions among industry leaders.

If 2025 is the starting point of the golden age of agents, then 2026 will focus on Agent Frameworks (Agent Harnesses).

An Agent Framework is infrastructure built around AI models to manage long-running tasks, but it is not an agent itself.

This time, Agentica understood the game mechanics from scratch and solved multiple level puzzles without any game-specific prompts.

What makes the Arcgentica RLM framework, built on the Agentica SDK, so special?

First, it is game-agnostic.

ARC-AGI-3 is difficult because it strips away all natural language prompts. Humans can pass because we possess physical intuition.

To address this, Agentica adopted the most extreme "Game-agnostic" strategy.

The agent does not know what colors represent, what actions do, or what the winning conditions are; it infers everything solely by interacting with the game and observing changes.

This blank state is precisely what enabled its success.

Second is the "Orchestrator + Specialized Sub-Agent" model.

Top-tier orchestrators never operate the game directly; they delegate tasks to sub-agents, accumulate knowledge, and decide on the next course of action.

Specialized sub-agents include: explorers, theorists, testers, and solvers.

If an agent starts looking at the grid, its context gets filled with pixel data, causing it to lose strategic thinking ability. Sub-agents report in the form of short text summaries rather than raw data.

This clever design of a decentralized architecture avoids the severe defects seen in models like Opus 4.6, where "the same brain has to look at pixels, remember rules, and command actions" simultaneously.

Third is its "Shared Memory" mechanism.

During the game, all agents share a memories database. Sub-agents record confirmed facts (scene layout, mechanics, winning conditions) and hypotheses (explicitly marked) during their work.

New agents query memory before starting, allowing them to inherit collective knowledge.

Fourth is the "Level Switching" mechanism.

Level Switching: Once a level is solved, the next level loads directly within the same operation, and the returned screen is already the new level.

The state=WIN is triggered only when all levels are cleared; the completion of individual levels is judged by observing the increase in levels_completed.

Fifth, Agentica employs strict action budget management; every token must be spent wisely.

The total number of operations across all levels is limited (approximately 800 times). The scheduler allocates operation quotas to various sub-agents via make_bounded_submit_action(limit). The system requires agents to avoid repetitive operations unless truly stuck.

Moreover, it prioritizes targeted attempts over brute-force exhaustive exploration.

Additionally, there are regulations requiring sub-agents to allocate tools on demand and schedulers to weigh reuse versus restart.

It is worth noting that the official positioning of ARC-AGI-3 emphasizes the need for "exploration, perception → planning → action, memory, goal acquisition, and alignment capabilities."

Agentica's division of labor and control strategy is almost an "engineering decomposition" of these capabilities:

Exploration: Executed by sub-agent explorers under an action budget, attempting to extract "mechanism clues" via differential observation.

Planning/Rule Inference: Conducted by sub-agent theorists under the constraint of "no submit_action allowed" to deduce rules, reducing meaningless action consumption.

Memory: The explicit memories database makes cross-level strategy reuse more direct, lowering the action and token costs of "re-learning."

Long-range Adaptation: Level transitions are detected by levels_completed, and the orchestrator decides whether to continue the current strategy or re-enter the exploration loop.

Clearly, this mechanism aligns perfectly with the ARC-AGI-3 scoring structure (higher weight for later levels, squared penalty for efficiency)—it encourages the system to spend actions on experiments with the "highest information gain" and migrate strategies to higher-weight levels as soon as possible.

Is the 36.08% High Score Inflated?

However, while a 36% score is undoubtedly dazzling, until verified by the ARC Prize officials, Symbolica's "upset victory" remains shrouded in several layers of fog.

Symbolica also admits that this result has not yet received official certification from the ARC-AGI-3 organizing committee.

There is a very key phrase in the material: "unverified competition score."

Whether Symbolica's current result is based on an environment they built themselves or strictly replicates the official evaluation process remains a question mark.

Furthermore, there are some unusual details in the published score breakdown.

For instance, Symbolica pointed out that "human baseline scores obtained via the ARC-AGI-3 API indicate that game cn04 has a total of 6 levels. This does not match the number of levels for the corresponding game obtained via the API."

If official data suffers from version confusion, the validity of the scores becomes questionable.

Additionally, the score breakdown chart shows that games like LP85 and AR25 scored extremely high (80%-97%), while games like SP80 and BP35 scored extremely low (0.2%-0.7%).

Is this severe polarization caused by overfitting?

After all, if it were true general intelligence, performance should be relatively balanced across all games.

The Heart of the Matter: The Ultimate Test for AGI

Yesterday, upon its release, ARC-AGI-3 garnered immense attention and received endorsement from multiple AI luminaries including OpenAI, Google, and xAI.

(Scroll up and down to view)

Yesterday, when ARC-AGI-3 was officially released, Sam Altman even showed up in person to support it.

This new benchmark is widely recognized as the "Polaris" leading the way to AGI.

For a long time, the metrics of the AI industry have been locked within the framework of static benchmarks.

However, with the emergence of AI agents like OpenClaw representing "brute-force evolution," the industry urgently needs a scalpel to cut open the black box of "proactive intelligence": such as bottomless curiosity for exploration, millisecond-level perceptual decision-making, complex path planning, and near-intuitive goal alignment.

Competition Link: https://www.kaggle.com/competitions/arc-prize-2026-arc-agi-3/data

The exam posed by ARC-AGI-3 is pressing AI: In the face of completely unfamiliar rules, do you possess the instinct for abstraction and reasoning that humans have?

See the following link for the ARC AGI 3 Technical Report:

https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

Here, every game requires the agent to explore, understand, and solve. A perfect score (100%) means the AI agent can clear all games as efficiently as a human.

Currently, the best score is 0.25, which is equivalent to 25% of the human baseline.

The more significant meaning of ARC-AGI-3 is not just releasing a new AI test, nor is it a satisfying story of a grassroots underdog defeating AI giants; rather, it has ushered in a new type of agent—Agentic Thinking.

Coincidentally, almost simultaneously with the release of ARC-AGI-3, Junyang Lin published a summary of the past two years, pointing out the same trend:

"Autonomous thinking (agentic thinking) will become the mainstream mode of thought."

"... Even when facing extremely difficult mathematical or programming tasks, a truly advanced (AI) system should have the right to search, simulate, execute, check, verify, and correct."

Essentially, agentic thinking is reasoning through action; it focuses on whether the model can sustain progress during its interaction with the environment.

He pointed out that the core issue of AI reasoning capability has shifted from "can the model think long enough" to "can the model think in a way that sustains effective action."

The underlying philosophy of ARC-AGI-3 and Junyang Lin's insights undoubtedly align.

This coincidence is likely the next direction for the industry.

References:

https://x.com/JustinLin610/status/2037116325210829168

https://github.com/symbolica-ai/ARC-AGI-3-Agents

https://www.symbolica.ai/blog/arc-agi-3

Stunning Reversal in World's T hardest Exam: Dark Horse AI Breaks 36% Barrier as Top Models Crash

Symbolica's Day-One Surprise: Surging to 36%

Core Secret: Arcgentica RLM Harness

Is the 36.08% High Score Inflated?

The Heart of the Matter: The Ultimate Test for AGI

Related Articles

分享網址