Running ARC and Sudoku with 10M Parameters? Bengio's Team Bets on Multi-Trajectory Reasoning

10M parameters, in the era of large models, seems somewhat insignificant.

But GRAM, proposed by Yoshua Bengio's team in collaboration with researchers from KAIST, Mila, and NYU, has achieved several noteworthy results with models of this scale.

On Sudoku-Extreme, the accuracy reaches 97.0%, and on ARC-AGI, a benchmark for few-shot pattern recognition and abstract visual reasoning, it achieves 52.0% (ARC-AGI-1) and 11.1% (ARC-AGI-2), respectively.

The paper also lists some large model results as a reference for task difficulty: DeepSeek-R1, Claude 3.7 16k, and o3-mini-high all scored 0.0% on Sudoku-Extreme.

However, the authors explicitly emphasize that these results are not controlled baselines under equal training and inference settings, and cannot be directly interpreted as a small model fairly defeating a large model.

GRAM performance comparison on Sudoku-Extreme and ARC-AGI

GRAM surpasses recurrent models like HRM and TRM on Sudoku-Extreme and ARC-AGI-1/2; large model scores are only for task difficulty reference.

The main change behind these results lies not in the recursive reasoning itself, but in GRAM transforming deterministic recursive updates into a probabilistic multi-trajectory computation.

Traditional recurrent architectures repeatedly update a hidden state through a shared transition function, extending internal computation without increasing parameter count.

Models like HRM and TRM have shown the potential of this route, but most are still deterministic recurrent models: the same input and initialization will correspond to the same latent space trajectory, lacking a mechanism to explore other candidate trajectories.

Based on this problem, the team proposed GRAM, or Generative Recursive Reasoning Model, which transforms a single deterministic recurrent trajectory into a probabilistic latent variable process and allows for parallel sampling of multiple latent space reasoning trajectories during inference.

Comparison between deterministic recursion and GRAM's multi-trajectory approach

Deterministic Recursion vs. GRAM Multi-Trajectory Comparison.

Research paper title graphic

Paper Title:

Generative Recursive Reasoning

Paper Link:

http://arxiv.org/abs/2605.19376

Project Homepage:

https://ahn-ml.github.io/gram-website/

How Recursive Updates Become Multi-Trajectory Sampling

The core of GRAM is the reconstruction of the hidden state update mechanism. The model decouples the hidden state into high and low layers z=(h,l), undertaking computational tasks at different time scales.

The low-layer state l is responsible for fine-grained intermediate computations. During a hidden variable transition, it performs K deterministic updates continuously while the high-layer state is fixed:

Mathematical formula for low-layer state update

The high-layer state carries more abstract reasoning states and is updated once per transition. The model first generates a deterministic candidate state based on the low-layer computation:

Mathematical formula for candidate state generation

Then, Gaussian noise dependent on the current state is injected into the candidate state:

Mathematical formula for noise injection

The mean guides the reasoning direction, and the variance controls the exploration amplitude. The paper specifically points out that randomness is only added to the high-layer state h; the authors attempted to inject noise into the low-layer state but observed no performance improvement.

During the training phase, GRAM employs a deep supervision mechanism with truncated gradient propagation, optimizing a truncated surrogate objective. Appendix experiments show that on Sudoku-Extreme and N-Queens, the trend of the full Evidence Lower Bound (ELBO) and the truncated surrogate objective is generally consistent. But the paper also explicitly states that this remains a biased but memory-saving approximation, not an exact ELBO.

GRAM core architecture diagram

GRAM core architecture diagram, showing the single-step stochastic latent space transition process.

Ablation Study Verifies: Stochastic Guidance is Indispensable

Ablation experiments further illustrate that stochasticity and guidance signals need to work simultaneously.

Ablation study results

The ablation study demonstrates the impact of deep supervision, hierarchical recursion, and stochastic guidance.

In N-Queens, using only deep supervision and hierarchical recursion (HRM/TRM) achieved 80.70% and 72.90% respectively. After introducing stochastic guidance, the +DS+SG configuration reached 100.00%, and the full GRAM model achieved 99.69%. Simultaneously, the full GRAM reached 93.96% on Sudoku-Extreme, showing more dominant comprehensive performance.

Mechanism deconstruction provides more direct evidence. If the guidance signal is removed (mean zeroed out, only random noise retained), the N-Queens accuracy drops to 50.27%; if stochasticity is completely removed (variance zeroed out, only the guidance mean retained), the accuracy plummets to 0.0%. This indicates that GRAM's gains do not come from random decoding or random initialization, but from stochastic guidance under variational training, making random trajectories a learnable and selectable reasoning resource.

Comparison of latent space trajectories between TRM and GRAM

Comparison of latent space trajectories between TRM and GRAM.

Inference-Time Scaling and Multi-Solution Tasks

Beyond recursive depth, GRAM introduces a width dimension for inference-time computation scaling. Through a latent process reward model that predicts the likelihood of a candidate trajectory eventually producing the correct answer, the model can select the output with the highest predicted value among multiple sampled candidates, or use majority voting.

In inference-time scaling tests, GRAM only needs to sample N=20 trajectories in parallel over 16 iterations to achieve 97.0% accuracy on the Sudoku task. This result surpasses TRM's 90.5% accuracy after 320 iterations.

Inference-time scaling and multi-solution task accuracy graph

Inference-time scaling and multi-solution task accuracy trends.

Multi-solution tasks better demonstrate the value of this design. Facing N-Queens, GRAM achieves 99.7% accuracy and covers 90.3% of distinct valid solutions. In an 8-node graph coloring task, GRAM reduces the number of conflicting edges to 2.7 (3.3 for 10 nodes), outperforming the autoregressive generative model's scores of 19.0 and 61.3, respectively.

In additional experiments on ARC-AGI-1, the authors also compared the relationship between data augmentation and parallel sampling. Without external data augmentation, GRAM's performance improves as the number of samples increases; when data augmentation is strong, the marginal benefit of increasing sample count tends to saturate. This suggests that data augmentation and inference-time sampling play complementary roles and cannot be understood as simple additive effects.

Interaction between data augmentation and inference-time sampling

Interaction relationship between data augmentation and inference-time sampling.

From Conditional Reasoning to Unconditional Generation

GRAM itself is a latent variable generative model. When the input is replaced with an empty conditional input, or fixed to a certain condition, the same recursive process can also be defined as an unconditional generative model. In unconditional Sudoku generation, the model starts from an empty board to generate a complete grid, and its validity is assessed based on standard Sudoku rules. GRAM, using 10.9M parameters and 16 supervised steps, achieves a validity rate of 99.05%. For comparison, the discrete diffusion model D3PM, using 55.1M parameters and 1000 denoising steps, achieves a maximum validity rate of 91.33%. The generation phase involves no explicit constraint checker or search process, meaning the model does not rely on an external search to correct results, but gradually forms a rule-compliant board during the recursive generation process.

Unconditional Sudoku generation example

Unconditional Sudoku generation example.

In binary MNIST image generation, GRAM starts from an initial generation state under an empty condition input and progressively refines the image structure through recursive hidden state updates. When the number of recursive steps increases from 8 to 256, the FID score drops from 84.08 to 73.34, with IS score improving simultaneously.

MNIST image unconditional generation process

MNIST image unconditional generation process.

Summary

The most noteworthy aspect of this paper is the transformation of recursive reasoning from a single deterministic trajectory into a probabilistic process that can sample multiple candidate trajectories. At least in structured reasoning and multi-solution constraint tasks, this design yields better exploration capabilities and constraint satisfaction quality. Width-based parallel sampling also means inference-time computation no longer solely depends on the number of recursive steps. It is important to emphasize that GRAM is currently primarily validated on controlled tasks such as Sudoku, ARC-AGI, N-Queens, Graph Coloring, and binarized MNIST. The paper also acknowledges that the sequential training of deep supervision limits training efficiency, which is an unavoidable constraint if GRAM is to be scaled up to larger foundation models.

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.