More and more, modern LLM agents are being used as executors to "get one complex thing done," from Claude Code to Codex, where multi-step debugging and repeated decision-making have become the norm. However, a counter-intuitive finding has emerged: have the same agent perform a task with "identical reasoning complexity, just more steps," and its training will outright collapse. This paper isolates "number of task steps" as a variable for study and delivers a systematic empirical conclusion: horizon length itself is the fundamental bottleneck in long-horizon agent training.
[Figure 1: Overview of the paper's contributions] The paper studies long-horizon LLM agent training from a horizon perspective, pointing out that horizon length is the fundamental bottleneck and demonstrating that horizon reduction can stabilize RL while enhancing the model's generalization tendency to longer tasks.
Figure 1
What Exactly Makes Long Tasks So Hard
Existing research on long-horizon agents mostly follows two paths: either doing context engineering and workflow orchestration at the system level, or performing SFT and RL at the model level. But the paper points out that these works are largely incremental extensions of the single-turn paradigm, ignoring the impact of horizon length as an independent variable on training dynamics.
The paper first breaks down the fuzzy word "horizon" into three formal definitions: (1) Goal Distance d(s₀, g), which is the minimum number of atomic actions required to reach the goal under an optimal policy; (2) Interaction Budget H_max, the maximum interaction steps allowed by the environment; and (3) Effective Horizon h_π(s₀, g), the actual number of steps policy π takes to complete the task.
Why do tasks become harder as the number of steps increases? The paper summarizes two points: the complexity of state-action mapping grows non-linearly with the horizon, making early decisions place strict constraints on subsequent spaces, and the probability along the optimal trajectory decays exponentially. At the same time, credit assignment under sparse rewards becomes extremely ambiguous; when an entire trajectory fails, all intermediate steps (including those that were locally correct) are tagged with a negative advantage, amplifying gradient noise.
Stripping Horizon Away from "Problem Difficulty"
In long-horizon tasks, the number of steps is usually tightly coupled with reasoning difficulty—a Sudoku puzzle with more empty cells not only requires more steps but also demands more advanced solving techniques. What the paper does is decouple the step-count variable in isolation.
The specific approach is to convert tasks into a short-horizon version using a "one-step proxy" (for example, having the model generate an entire Sudoku solution at once), retain only those instances the model can solve in this short form, and then group them into seven tiers, L1–L7, based on goal distance. In the resulting dataset, the "problem-solving capability requirements" across different tiers are aligned, leaving the number of steps as the primary difference.
[Table 1: Dataset tier statistics] Tasks are grouped into L1–L7 by d(s₀, g): L1–L2 (11–15, 16–20) and L3–L4 (21–25, 26–30) are used for training, with 640 training samples and 100 test samples per tier; L5–L7 (31–35, 36–40, 41–45) are only used for horizon generalization evaluation, with the first two tiers having 100 samples each and L7 having 50.
The paper uses text-based puzzles as its evaluation environment, primarily Sudoku, with a cross-validation on Rush Hour. Sudoku difficulty is classified using the HoDoKu tool, retaining only puzzles solvable with "basic techniques" to ensure that differences stem solely from the horizon.
Short Horizons Are Rock Solid, Long Horizons Collapse
The experimental base model is Qwen3-1.7B. The paper first collects SFT trajectories using large models like GPT-5-mini (Sudoku trajectories are also distilled by GPT-5-mini into more compact CoT, Chain-of-Thought, chains). Then, 4 epochs of RL are performed on top of this with a temperature of 0.8. The RL algorithm reverts to the basic REINFORCE, assisted by Masked IS (based on geometric mean ratio) and Truncated IS (based on sequence-level ratio) to handle off-policy drift between training and inference. Rewards are split into trajectory-level discounted return and step-level format/valid penalties, batch-normalized separately, and then combined with a weight of α=0.2.
[Figure 2: Training dynamics under different goal distances] On short goal distances (L1–L2), RL training converges stably; as goal distance increases to L3–L4, training shows severe instability and often ends with a performance crash, accompanied by a sharp rise in the "max-length response ratio."
The paper provides a mechanistic interpretation: A gradient update from a negative advantage is essentially a divergent signal—it suppresses the probability of the sampled token, but spreads that probability mass uniformly across thousands of other tokens in the vocabulary. In an LLM with |V| ≈ 10⁵, this is like indiscriminately boosting a bunch of irrelevant tokens, thereby amplifying the optimization variance. This is one of the root causes of the observed collapse in long-horizon tasks.
Horizon Reduction: A Simple but Effective Principle
The paper's core proposition is direct: Instead of having an agent learn an unlearnable long dependency, it's better to simply compress the effective horizon.
The first approach is Macro Actions, which allow the policy to output multiple atomic actions in a single step. In Sudoku, this means allowing one step to fill multiple cells; in Rush Hour, it allows operations like move(id, direction, N) that move a car multiple spaces at once. This structurally shortens h_π(s₀, g) for the same task. The second approach is Subgoal Decomposition, which breaks the global goal g into (g₁, g₂, …, gₖ) and computes returns independently for each segment. The paper validated this on Sudoku using "subgrid completion" as a verifiable subgoal.
[Figure 3: Effect of horizon reduction on Sudoku and Rush Hour] In both environments, using macro actions leads to more stable and effective RL. Particularly in settings with longer goal distances, the atomic action policy collapses while the macro action policy maintains steady growth.
A natural question arises: does the benefit of macro actions come from a "stronger starting policy" or from the "shortened effective horizon" itself? The paper conducts a clean ablation—using the same macro-action policy, but constraining the environment to execute only one atomic action per step. Thus, the policy representation remains unchanged while the horizon is artificially lengthened.
[Figure 4: RL stability depends on the effective horizon] Retaining the macro-action policy representation but forcing it into a single-step execution mode causes performance to rise first and then crash. The true horizon-reduced setup, though climbing more slowly, converges stably to high performance. This directly proves that effective horizon is the primary factor in determining training stability.
For subgoal decomposition, the paper tests in the L3–L4 range, where the sparse-reward baseline has already failed, by computing Gₜ independently for each completed subgrid segment. The results are starkly contrasting: sparse reward shows almost no progress, while the subgoal-guided policy rises stably and achieves strong performance.
Holds True Across Environments, Model Sizes, and Optimizers
Could this be a puzzle-specific problem? Is 1.7B parameters too small? Is REINFORCE to blame? The paper performs three sets of verification.
[Figure 7: Robustness across environments, model scales, and optimizers] On WebShop, horizon reduction simultaneously improves training stability and average success rate. On Sudoku L3–L4 with a 4B model, atomic action still collapses, while horizon reduction leads to stable improvement. Switching to a GRPO (Group Relative Policy Optimization)-style group-normalized optimizer produces the same "rise then fall" pattern, which is resolved by horizon reduction.
In other words, the horizon bottleneck is a pervasive bottleneck that exists across environments, model scales, and optimizers, and horizon reduction is a cross-cutting, universal solution.
An Unexpected Bonus: Horizon Generalization
The paper also found an interesting phenomenon: Policies trained on a limited goal distance can generalize to longer horizons never seen during training. On Sudoku, models trained on L3–L4 still have a notable success rate on the longer L5–L7 tasks, and its performance gap against the baseline widens as the goal distance increases. The paper calls this horizon generalization.
[Figure 8: Horizon generalization] On both Sudoku and Rush Hour, policies trained with a limited goal distance can effectively generalize to unseen, longer horizons. Meanwhile, the macro-action policy trained via horizon reduction has higher per-step accuracy and fewer decision points, making it more error-resistant on long horizons.
The practical implication for training is this: First establish stable competence on a short horizon, then bootstrap to longer tasks; this is a lower-cost curriculum path. On Rush Hour, training directly on 10 ≤ d ≤ 12 yields almost no gain; whereas a curriculum strategy of "first 4 ≤ d ≤ 9, then 10 ≤ d ≤ 12" significantly outperforms direct training.
Ultraman Tiga: Insights for Long-Horizon Agent Design
The paper extends its observations to broader agent design paradigms. The effectiveness of code-based agents lies in how they use programs with loops and conditionals to compress a long series of tool calls into a single step of execution, implicitly performing horizon reduction. GUI agents that use high-level API calls instead of numerous low-level clicks are essentially doing the same thing. Subgoal decomposition aligns with the ideas behind hierarchical RL, compressing a long-horizon problem into a sequence of short-horizon subproblems to localize credit assignment.
Before complex RL algorithms and domain-specific methods, horizon-aware environment and action space design should be placed at a higher priority. The paper's conclusion is clear: managing the effective horizon is a prerequisite for scalable long-horizon agent learning, not an optional add-on.
Original Title: On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Original Link: https://arxiv.org/abs/2605.02572