Alexia Jolicoeur-Martineau

Less is More: Recursive Reasoning with Tiny Networks

少即是多：微模型的递归推理

Abstract

The Hierarchical Reasoning Model (HRM) is an innovative method that uses two small neural networks that recurse at different frequencies. This biologically inspired approach outperforms large language models (LLMs) on challenging tasks such as Sudoku, mazes, and ARC-AGI, while being trained with only small models (27 million parameters) and a small amount of data (about 1000 examples). HRM shows great promise in using small networks to solve difficult problems, but its principles are not yet fully understood and may not be optimal. We propose the Tiny Recurrent Model (TRM), a more concise method for recursive reasoning. Using only a single small network with just 2 layers, its generalization capability significantly surpasses HRM. TRM, with only 7 million parameters, achieves a test accuracy of 45% on ARC-AGI-1 and 8% on ARC-AGI-2, outperforming most large language models (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) while using less than 0.01% of their parameters.

1 Introduction

Despite the power of Large Language Models (LLMs), they may still face challenges in solving difficult reasoning problems. Since they generate answers autoregressively, there is a high risk of error, as a single wrong token can lead to an invalid answer. To improve their reliability, LLMs rely on Chain-of-Thought (CoT) (Wei et al., 2022) and Test-Time Computation (TTC) (Snell et al., 2024). CoT aims to mimic human reasoning by having the LLM sample a step-by-step reasoning trajectory before giving an answer. This can improve accuracy, but CoT is costly, requires high-quality reasoning data (which may not be available), and can be fragile because the generated reasoning may be incorrect. To further improve reliability, Test-Time Computation can be used, which reports the most common answer or the answer with the highest reward from K answers (Snell et al., 2024).

In this work, we show that the benefits of recursive reasoning can be greatly enhanced, with improvements that go far beyond incremental. We propose the Tiny Recurrent Model (TRM), an improved and simplified method that uses a much smaller tiny network with only 2 layers, achieving significantly higher generalization than HRM on a variety of problems. With this method, we increased the accuracy on the Sudoku-Extreme test set from 55% to 87%, on the Maze-Hard test set from 75% to 85%, on ARC-AGI-1 from 40% to 45%, and on ARC-AGI-2 from 5% to 8%.

2. BackgroundThe algorithmic description of HRM is detailed in Algorithm 2. We will discuss the details of this algorithm further below.

2.1. Structure and ObjectiveHRM's research focus is on supervised learning. Given an input, generate an output. It is assumed that both the input and output have the shape [B, L] (when the shapes are different, padding tokens can be added), where B is the batch size and L is the context length.

2.2. Recursion at Two Different Frequencies

2.3. Fixed-Point Recursion with One-Step Gradient Approximation

2.7 HRM SummaryHRM utilizes recursion and deep supervision across two networks operating at different frequencies (high and low) to learn to improve its answer over multiple supervised steps (and uses ACT to reduce the processing time per data sample). This allows the model to mimic a very deep network without backpropagating through all layers. The method achieves significantly higher performance on difficult reasoning tasks that are challenging for conventional supervised models. However, the method is quite complex, relying heavily on uncertain biological arguments and a fixed-point theorem that is not guaranteed to apply. In the next section, we will discuss these issues and the potential improvement goals for HRM.

3. Improvement Goals for the Hierarchical Reasoning ModelIn this section, we identify the key improvement goals for HRM, which will be addressed by our proposed method, the Tiny Recurrent Model (TRM).

3.1. Implicit Function Theorem and One-Step Gradient Approximation

HRM only backpropagates through the last 2 out of 6 recursions. The authors justify this by applying the Implicit Function Theorem and a one-step approximation, which states that when a recursive function converges to a fixed point, backpropagation can be performed at that equilibrium point with a single step.

Therefore, although there is some basis for applying the Implicit Function Theorem and one-step gradient approximation to HRM, as residuals tend to decrease over time, the model is likely not at a fixed point when the theorem is applied in practice.

In the next section, we will show that we can bypass the need for the Implicit Function Theorem and one-step gradient approximation, thus completely avoiding this issue.

3.2. Adaptive Computation Time (ACT) Doubles Forward Passes

HRM uses Adaptive Computation Time (ACT) during training to optimize the time spent per data sample. Without ACT, each data sample would require Nsup=16 supervised steps, which is very inefficient. They implement ACT with an additional Q-learning objective that decides when to stop and move to a new data sample instead of continuing to iterate on the same data. This allows for more efficient use of time, especially since the average number of supervised steps during training is quite low when using ACT (according to their reported data, less than 2 steps on the Sudoku-Extreme dataset).

However, ACT comes at a cost. This cost is not directly shown in the HRM paper but is evident in their official code. The Q-learning objective relies on a stop loss and a continue loss. The continue loss requires an additional forward pass for HRM (including all 6 function evaluations). This means that while ACT optimizes time per sample more efficiently, each optimization step requires 2 forward passes. The specific formula is shown in Algorithm 2.

In the next section, we will show how to avoid the need for two forward passes in ACT.

3.3. Hierarchical Interpretation Based on Complex Biological Arguments

The authors of HRM justify the design of two latent variables and two networks operating at different levels based on biological arguments, but these arguments are far removed from artificial neural networks. They even attempt to link HRM to actual brain experiments on mice. While this is interesting, this interpretation makes it extremely difficult to understand why HRM is designed the way it is. Given the lack of ablation study tables in their paper and the heavy reliance on biological arguments and a fixed-point theorem (which does not fully apply), it is hard to determine which parts of HRM are responsible for what and why. Additionally, it is unclear why they use two latent features instead of other feature combinations.

In the next section, we will show that the recursive process can be greatly simplified and understood in a much simpler way that does not require any biological arguments, fixed-point theorem, hierarchical interpretation, or two networks. This also explains why 2 is the optimal number of features (i.e., ).

4. Tiny Recurrent Model

In this section, we introduce the Tiny Recurrent Model (TRM). In contrast to HRM, TRM does not require complex mathematical theorems, hierarchical structures, or biological arguments. It performs better in generalization while requiring only a tiny network (instead of two medium-sized networks), and ACT requires only a single forward pass (instead of 2). Our method is described in Algorithm 3 and illustrated in Figure 1. We also conducted ablation experiments on the Sudoku-Extreme dataset (a difficult Sudoku dataset with only 1K training samples but 423K test samples), as shown in Table 1. The key components of TRM are explained below.

4.1. No Need for a Fixed-Point Theorem

Although this interpretation is intuitive, we still wanted to verify whether using more or fewer features would be helpful. The results are shown in Table 2.

Single Feature: Similarly, we tested the idea of using only one feature, that is, passing only y between supervised steps. This method is described in Algorithm 4. In this way, we found that performance decreased. This is also expected, as discussed earlier, because it forces the model to store the solution y in z.

Therefore, we explored using more or fewer latent variables on the Sudoku-Extreme task, but found that using only y and z is not only the simplest and most natural method, but also yields higher test accuracy.

4.4. Less is More

We attempted to scale the model by increasing the number of layers to improve capacity. Surprisingly, we found that increasing the number of layers reduced generalization due to overfitting. Conversely, by proportionally increasing the number of recursions (n) while reducing the number of layers (to keep the computational cost and simulated depth roughly the same), we found that using 2 layers (instead of 4) maximized generalization. By doing so, we achieved better generalization on Sudoku-Extreme (increasing TRM from 79.5% to 87.4%; see Table 1) while reducing the number of parameters by half (again).

Smaller networks performing better is quite surprising, but 2 layers seem to be the optimal choice.Bai & Melas-Kyriazi (2024)also observed optimal performance with 2-layer networks in the context of depth-balanced diffusion models; however, their performance was similar to larger networks, while we observed that 2-layer networks performed better. This may seem unusual because for modern neural networks, generalization tends to be directly related to model size. However, when data is too scarce and the model size is too large, there can be a penalty for overfitting (Kaplan et al., 2020). This may indicate that the amount of data is too small. Therefore, using tiny networks with deep recursion and deep supervision seems to allow us to avoid many overfitting problems.

4.5. Attention-Free Architecture for Fixed Small Context Length Tasks

4.7. Exponential Moving Average

On small datasets like Sudoku-Extreme and Maze-Hard, HRM tends to overfit quickly and then diverge. To reduce this problem and improve stability, we adopted Exponential Moving Average (EMA) of weights, a common technique in GANs and diffusion models for improving stability (Brock et al., 2018; Song & Ermon, 2020). We found that it prevents sharp crashes and leads to higher generalization (from 79.9% to 87.4%; see Table 1).

4.8. Optimizing the Number of Recursions

In the next section, we will show the main results comparing HRM, TRM, and LLMs across multiple datasets.

5. Results

Following Wang et al. (2025), we tested our method on the following datasets: Sudoku-Extreme (Wang et al., 2025), Maze-Hard (Wang et al., 2025), ARC-AGI1 (Chollet, 2019), and ARC-AGI-2 (Chollet et al., 2025). The results are shown in Tables 4 and 5. The hyperparameters are detailed in Section 6. The datasets are discussed below.

Sudoku-Extremecontains extremely difficult Sudoku puzzles (Dillion, 2025; Palm et al., 2018; Park, 2018) (9x9 grid), using only 1K training samples to test few-shot learning capabilities. Tested on 423K samples.

Maze-Hardcontains procedurally generated 30x30 mazes by Lehnert et al. (2024) with shortest path length exceeding 110; the training and test sets each contain 1000 mazes.

ARC-AGI-1and ARC-AGI-2are geometric puzzles with prizes. Each puzzle is designed to be easy for humans but difficult for current AI models. Each puzzle task contains 2-3 input-output demonstration pairs and 1-2 test inputs to solve. The final score is calculated as the accuracy of generating the correct output grid on all test inputs after two attempts. The maximum grid size is 30x30. ARC-AGI-1 contains 800 tasks, while ARC-AGI-2 contains 1120 tasks.We also used 160 tasks from the closely related ConceptARC dataset (Moskvichev et al., 2023) to augment our data. We provide results for ARC-AGI-1 and ARC-AGI-2 on the public evaluation set.

Although these datasets are small, extensive data augmentation is used to improve generalization. Sudoku-Extreme uses 1000 shuffling augmentations per data sample (without violating Sudoku rules). Maze-Hard uses 8 dihedral transformations per data sample. ARC-AGI uses 1000 data augmentations per data sample (color permutation, dihedral group transformations, and translation transformations). Dihedral group transformations include random 90-degree rotations, horizontal/vertical flips, and reflections.

From the results, it can be seen that TRM without self-attention achieved the best generalization on Sudoku-Extreme (87.4% test accuracy). Meanwhile, TRM with self-attention generalized better on other tasks (possibly due to inductive bias and the tendency of MLPs to overfit on large 30x30 grids). TRM with self-attention achieved 85.3% accuracy on Maze-Hard, 44.6% on ARC-AGI-1, and 7.8% on ARC-AGI-2, with 7M parameters. This is significantly higher than HRM which achieved 74.5%, 40.3%, and 5.0% with 4 times the parameters (27M).

6. Conclusion

We propose the Tiny Recurrent Model (TRM), a simple recursive reasoning method that achieves strong generalization on difficult tasks by recursing on latent reasoning features and using a single tiny network to gradually improve the final answer. Compared to the Hierarchical Reasoning Model (HRM), TRM does not require a fixed-point theorem, complex biological arguments, or hierarchical structures. It significantly reduces the number of parameters by halving the number of layers and replacing two networks with a single tiny network. It also simplifies the stopping process, eliminating the need for an additional forward pass. Overall, TRM is much simpler than HRM while achieving better generalization.

Although our method brings better generalization performance on 4 benchmarks, none of the choices we made are guaranteed to be optimal on all datasets. For example, we found that replacing self-attention with MLP works extremely well on Sudoku-Extreme (test accuracy improved by 10%), but performs poorly on other datasets. Different problem settings may require different architectures or parameter counts. Scaling laws are needed to optimize the parameterization of these networks. While we have simplified and improved deep recursion, why recursion helps more than using larger and deeper networks remains to be explained; we suspect it is related to overfitting, but we have no theory to support this explanation. Not all of our ideas were successful; we will briefly discuss some ideas we tried but did not succeed in Section 6.

Currently, recursive reasoning models like HRM and TRM are supervised learning methods, not generative models. This means that given an input question, they can only provide a single deterministic answer. However, in many cases, a question may have multiple answers. Therefore, extending TRM to generative tasks would be a meaningful research direction.

Original link:https://arxiv.org/pdf/2510.04871