Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies in LLM Reinforcement Learning

Reinforcement Learning (RL) has become a core post-training tool for enhancing the reasoning abilities of large language models (LLMs). In RL post-training systems, the rollout—a sampled trajectory from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions—determines the data on which the optimizer learns. However, rollout design is often underappreciated and treated as an implementation detail. This survey provides an optimizer-agnostic perspective on rollout strategies for RL-based reasoning LLM post-training. We formalize a unified symbolic rollout pipeline and introduce the GFCR (Generate-Filter-Control-Replay) lifecycle taxonomy. This organizes the rollout pipeline into four modular stages: Generate proposes candidate trajectories; Filter constructs intermediate signals via verifiers, judges, or critics; Control allocates computational resources and makes continue/branch/stop decisions under a budget; and Replay preserves and reuses artifacts across rollouts without updating weights. We supplement this with three criteria—Reliability, Coverage & Informativeness, and Cost Sensitivity—to describe the trade-offs that rollout designs must balance. Through case studies in mathematics, code/SQL, multimodal reasoning, tool-use agents, and agentic skill benchmarks, we validate the framework's effectiveness.

1 Introduction

As the role of reinforcement learning in LLM post-training grows, the importance of rollout strategies becomes increasingly prominent. A rollout is a sampled trajectory from a prompt to termination. In a text-only setting, it manifests as a complete output containing intermediate reasoning and a final answer. In tool or environment interaction settings, it includes action-observation loops and external feedback. Rollout design often dominates training costs and learning signal quality, yet existing literature focuses mostly on optimization algorithms and reward modeling, while the specifics of rollout strategy are often underestimated or hidden.

The four illustrations in the paper are as follows:

  • Figure 1
    Provides a comprehensive overview of the rollout lifecycle and the GFCR decomposition. It shows that the rollout pipeline can be understood as modular choices regarding how trajectories are proposed, how intermediate signals are constructed, how computation is allocated, and which artifacts are preserved and reused across rollouts.
  • Figure 2
    Presents GFCR as a complete, end-to-end rollout system. Given a prompt and a computational budget, Generate samples a batch of rollouts; Filter builds intermediate signals and training supervision for each rollout; Control decides on continuation/pruning/resampling and what to store based on cost and signals; Replay retrieves/stores artifacts to condition future generation. The objective is to maximize expected utility under computational constraints.
  • Figure 3
    Shows the rollout criteria taxonomy, organized into three intersecting dimensions: Reliability (achieving trustworthy signals through verifiers or robust judges), Coverage & Informativeness (diverse candidates and disagreement/uncertainty), and Cost Sensitivity (achieving value under a budget through value-to-cost ratios and early stopping).
  • Figure 4
    Depicts the design space of the Generate module. Rollout proposal mechanisms are organized along three axes: Topology & Interaction (linear, group, tree/graph, multi-turn/tool rollouts); Guidance & Scaffolding (examples/rules, planning, reflection, adaptive guidance, tool augmentation); Sampling & Exploration Configuration (decoding parameters, uncertainty-aware allocation, partial rollouts and resumption, inference augmentation with only sampling).

The contributions of this survey include: the first systematic organization of rollout strategies; the proposal of the GFCR and criteria taxonomy; a synthesis of diverse rollout methods; validation through multi-domain cases; and the provision of a diagnostic index and open challenges.

Rollout lifecycle GFCR overview

2 Related Work

We contrast this survey with existing ones. Prior surveys are mainly organized around feedback modeling, reward learning, and optimization objectives, treating rollout strategies implicitly. For example, surveys on RLHF and preference learning emphasize feedback collection and modeling. Surveys on RL-enhanced LLMs summarize RLHF, RLAIF, and the direct preference family. Technical surveys focus on RL algorithms and training mechanisms. Pipeline-level surveys classify where RL appears in data generation, pre-training, post-training, and test-time inference. Inference- and agent-centric surveys focus on multi-step reasoning, search, and environmental interaction. In contrast, this survey takes the rollout strategy as the unit of analysis, providing a modular vocabulary to compare how different systems combine topology, sampling, scoring granularity, budget allocation, and experience reuse.

3 Foundations: Rollouts, Criteria, and the GFCR Framework

This section establishes the foundations. We first introduce the functional decomposition of GFCR: the four modules of Generate, Filter, Control, and Replay. We then define the global notation: a rollout τ = (x, u_1:T, o_1:T), where x is the prompt, u_t is the model action, and o_t is the environment observation. A training system typically samples a single rollout or a group of K rollouts. The Filter signal is denoted as ϕ, and the training signal S is derived from Score(ϕ). The computational cost c(τ) and budget B constrain the overall optimization.

GFCR modules are often interleaved: Filter signals trigger Control decisions (e.g., pruning, early stopping), Replay artifacts seed future Generation, and Control strategies determine which artifacts enter Replay.

The criteria taxonomy describes three desiderata: Reliability (verifiable results, robust judge scores), Coverage & Informativeness (diverse candidates, disagreement signals), and Cost Sensitivity (value normalization, early stopping). GFCR is the functional decomposition, while the criteria describe the rationale behind choices and how they are evaluated.

GFCR end-to-end rollout system
Rollout criteria taxonomy

4 Generate: How Trajectories Are Proposed

The Generate module specifies how candidate rollouts are proposed. The output is a candidate set T(x) = {τ^(i)}, influenced by topology (Topo), guidance (z), and sampling configuration (κ_G). Topology is divided into linear, group, tree/graph, and interactive. Linear rollouts sample a single trajectory. Group rollouts sample K parallel candidates, supporting intra-group comparison and variance reduction (e.g., GRPO). Tree/graph rollouts branch at intermediate prefixes, amortizing computation through shared prefixes and allocating budgets via pruning. Multi-turn/tool rollouts run in an action-observation loop.

Guidance and scaffolding include ICL seeding, plan conditioning, reflection sub-rollouts, adaptive guidance strength, and tool augmentation. Sampling strategies include decoding parameters (temperature, top-p), uncertainty-aware sampling (allocating computation based on reward variance or semantic entropy), and inference augmentation using only sampling.

Representative methods include GRPO, DAPO, TreeRPO, TreeRL, RAGEN, and others.

Generate module design space

5 Filter: From Rollouts to Learning Signals

The Filter module maps candidate rollouts into intermediate signals and optimizer-oriented supervision. Formally: ϕ_i = F(τ^(i); T(x)), including structural validity gating (parsing/compilation/executability), correctness verification (unit tests, exact match), process quality scoring (step-level PRM), comparative evaluation (pairwise/listwise judging), learning value signals (uncertainty, entropy), and training signal construction (weights, advantages, labels).

Structural validity gating filters out format-mismatched rollouts, reducing false negatives. Correctness verification is used for code (unit tests) and math (exact match). Process scoring provides partial credit at the step level. Comparative evaluation achieves relative preferences through judges. Learning value signals are used for weighting or guiding sampling.

Representative methods include xVerify, RLTF, CodeRL, the PRM by Lightman et al., and GRPO's intra-group normalization.

6 Control: Compute Allocation, Decision Rules, and On/Off-Policy Knobs

The Control module answers: under a limited budget, which samples are worth continuing to roll out, which prefixes should be stopped early, which branches should be expanded or pruned, and how much fresh on-policy versus historical off-policy data should be mixed during training. It transforms the intermediate signals from Filter, the computational cost of each trajectory, and the global budget constraint into a series of decisions, thus directly shaping the distribution of rollout batches seen by the optimizer. In other words, Generate determines 'what can be produced,' Filter determines 'what signals are available,' and Control decides 'where to spend the computational resources.'

Formally, Control can be understood as a sequential decision-making process under budget constraints. For a prompt x, the system maintains a set of partially unfolding trajectory prefixes and, at each step, decides whether to continue, prune, resample, or store based on cost c(τ), budget B, Filter signal ϕ, and training supervision S. The paper formulates its objective as maximizing learning utility U(T) under a per-prompt or global budget constraint. Here, utility can be usable sample size, signal strength, correctness improvement, or other proxies for training value.

6.1 Prompt and Task Selection

The first category of control occurs before rollout: choosing which prompts are worth generating for. Traditional approaches often sample uniformly from the training distribution, but many prompts contribute very low learning signals. For instance, when all samples in a rollout group have identical rewards, the intra-group advantage in GRPO-like methods collapses to zero, generating virtually no gradient. GRESO attempts to predict such zero-variance prompts and skip them while preserving exploration. VCRL treats intra-group reward variance as a proxy for sample difficulty, positing that prompts that are too easy or too hard often have low variance, while medium-difficulty prompts yield more useful learning signals.

Another line uses uncertainty modeling for task selection. VADE estimates each prompt's accuracy using a Beta posterior and biases sampling towards high-information prompts via Thompson sampling. SEED-GRPO does not directly select prompts but modulates the policy update magnitude based on the semantic entropy across multiple answers, applying more conservative updates to high-uncertainty samples. SEC models curriculum selection as a non-stationary multi-armed bandit, learning at the category level which difficulty or task types yield higher learning gains. Together, they reflect a trend: rollout allocation is no longer a fixed sampling process but an adaptive resource management problem.

6.2 Budgeting and Scheduling

The second category of control decides how much rollout width, depth, and token budget to allocate per prompt. Early GRPO-style training often uses a fixed number of K candidates, but a fixed width wastes computation on easy problems and may under-explore difficult ones. The paper summarizes variance-aware, difficulty-aware, and uncertainty-aware scheduling methods: sample less for low-information cases, and increase the number of candidates, search depth, or token budget for contentious or high-uncertainty samples.

Such scheduling also affects system throughput. Long reasoning rollouts exhibit a pronounced heavy tail, where a few very long samples can slow down synchronous training. The control layer therefore needs to consider rollout count, maximum length, candidate group size, tree search width, and batch load balancing within a single budget framework. The core trade-off is: a fixed budget brings stable implementation, while an adaptive budget improves computational efficiency but may also introduce selection bias and reproducibility difficulties.

6.3 Rollout Configuration Control

The third category of control targets the morphology of individual trajectories, including maximum length, 'deep thinking' mode, temperature, top-p, brevity rewards, and the ratio of positive to negative samples. ShorterBetter uses the shortest correct answer to define the Sample Optimal Length (SOL), aiming to learn an instance-adaptive optimal Chain-of-Thought (CoT) length. DECS points out the misalignment between trajectory-level rewards and token-level optimization, thus introducing decoupled token-level rewards and curriculum batch scheduling to reduce redundant tokens without suppressing necessary exploration.

Whether long reasoning is itself needed can also be controlled. AdaptThink observes that on simple problems, a direct answering mode might outperform long reasoning, and thus trains a model to choose between thinking and no-thinking modes based on problem difficulty. Large Hybrid-Reasoning Models use cold-start fine-tuning plus online RL to learn mixed thinking decisions. CoRL focuses on the performance-cost trade-off when invoking external LLM reasoning. GFPO and 'Train Long, Think Short' further illustrate that spending a bit more sampling and filtering cost during training can potentially yield shorter and more efficient inference at test time.

6.4 Early Exit, Branching, On/Off-Policy, and System Throughput

Control also covers early exit of partial rollouts, tree search pruning, and multi-agent branching control. If a prefix has been judged by a local checker as successful or high-confidence failure, the system can stop further generation. If certain branches in a tree have poor prospects, they can be pruned, transferring the budget to more promising branches. Methods like TreeRPO use tree sampling to estimate the expected rewards of different reasoning steps, constructing denser step-level training signals.

Finally, Control also decides how on-policy and off-policy data are mixed. On-policy rollouts align with the current policy but are expensive. Historical replay can improve sample efficiency but introduces the risk of policy drift. RePO adds a replay buffer into GRPO, and ReMix allows on-policy RFT methods like PPO/GRPO to utilize off-policy data. AR3PO mitigates out-of-control importance ratios by recomputing token probabilities of old responses under the current policy. At the system level, methods like ReSpec, DAS, TLT, EARL, and Seer incorporate speculative decoding, long-tail load balancing, dynamic parallelism, and similar sample reuse into the control problem to enhance rollout throughput.

7 Replay: Retention, Reuse, and Self-Evolution

The Replay module focuses on 'what is worth keeping, how to reuse it, and when to discard it' after a rollout ends. It is not a simple data cache, but organizes previously generated trajectories, verification signals, sub-steps, failure samples, correct anchors, and tool interaction records into retrievable artifacts, benefiting future Generate, Filter, and Control stages. The paper formalizes Replay using storage rules (Rstore) and retrieval rules (Rretrieve): the former determines which trajectories or signals enter the buffer, and the latter retrieves relevant artifacts for a new prompt based on similarity, correctness, diversity, cost, and freshness.

7.1 Response Resampling and Retention

The most direct form of replay treats complete responses as reusable units. It serves two purposes: one is to reuse past high-value samples, improving data efficiency; the other is to stabilize the advantage signal in group normalization objectives. For example, when all rollouts for a current prompt are incorrect, the reward variance for GRPO is zero, causing the gradient to vanish. DAPO seeks non-homogeneous batches—neither all wrong nor all right—through dynamic sampling, which increases inference cost. AR3PO retains early correct responses and injects cached correct samples when the current batch is all incorrect, giving flawed rollouts a negative advantage instead of a zero gradient.

Replay buffers also support off-policy reuse. Methods like RePO, ReMix, and ExGRPO leverage historical responses to improve sample efficiency while needing to handle policy drift and importance weighting issues. If the old policy πθ− is too different from the current policy πθ, reused samples can introduce bias; hence, they require probability recomputation, KL constraint enforcement, ranking by correctness/entropy/learning progress, or setting up refresh and eviction mechanisms.

7.2 Recomposition and Segment Reuse

The second replay granularity is not the entire trajectory but verifiable segments. Long reasoning, code repair, tool invocation, and agentic tasks often involve multiple sub-problems or sub-trajectories, some of whose prefixes, patches, tests, and tool results are reusable. Slicing, verifying, storing, and subsequently recombining these segments for new tasks can amortize shared computation and refine the learning signal from terminal correct/incorrect down to local steps.

This idea is particularly suited to code, SQL, mathematical proofs, and multi-step agents. In code tasks, verified patches, unit tests, error logs, and repair snippets can be cached. In math tasks, correct intermediate lemmas or short correct snippets can serve as scaffolding for future problem-solving. In tool agents, successful web navigation sub-procedures or API call sequences can be converted into retrievable skills. Compared to full rollout replay, segment replay is more flexible but also more dependent on boundary segmentation, segment correctness verification, and contextual compatibility assessment.

7.3 Self-Evolving Curricula and Intrinsic Feedback

The third form of replay goes a step further: rollouts are not just training data but actively expand the training distribution. Methods like STaR, Self-Rewarding, Self-Play RL, AGILE/Auto-RL, Agent0, and LANCE embody this self-evolving mindset. A model can generate new tasks, reflect on existing flaws, construct harder samples, annotate data with preferences, or even have a curriculum agent and an execution agent mutually improve each other: the curriculum agent proposes more difficult, tool-dependent problems, and the execution agent learns to solve them through RL.

The potential of such methods lies in reducing reliance on human annotation and continuously expanding capability boundaries; the risks lie in bias accumulation, quality drift, and untraceability. If self-generated tasks increasingly deviate from real needs, or if the reward model and policy drift together, replay will solidify erroneous preferences into the training. The paper therefore emphasizes that replay requires recording provenance, policy versions, verifier results, timestamps, and refresh status to ensure that reused samples are both valuable and auditable.

8 Domains and Case Studies

The paper views benchmarks as rollout interfaces: a task instance x comes from a distribution D, and the model produces a trajectory τ=(x,u1:T,o1:T) within the interface, where u is the model action and o is the environment observation. The core difference between domains is not just task content, but what feedback the interface returns, whether the verifier is reliable, whether the trajectory is multi-turn, whether segments can be reused, and how the budget should be allocated among depth, width, and replay.

8.1 Verifiable Language Interfaces

Mathematics, code, and SQL are the most typical verifiable language interfaces. Math tasks usually involve text-only rollouts, with a final answer verified after normalization using exact match or rule-based methods. Systems like DeepSeekMath, DeepSeek-R1, and SEED-GRPO demonstrate how an RLVR-style objective can be combined with math data and sampling strategies. Methods like TreeRL, TreeRPO, and VCRL further show that tree/group rollouts, variance-aware curricula, and uncertainty sampling significantly affect training stability and cost.

Code and SQL are execution-grounded interfaces. The model outputs a program, patch, or query, and the verifier is provided by compiling, running, unit tests, or database execution. Cases like CodeRL, RLTF, LiveCodeBench, BIRD, and Arctic-Text2SQL-R1 show that execution feedback naturally forms a multi-stage rollout of 'generate-execute-observe failure-repair.' Here, Filter is very concrete: is it compilable? Do tests pass? Is the execution result equivalent? Replay is also natural: one can cache patches that pass tests, error logs, partial queries, and verified segments.

8.2 Multimodal Reasoning Interfaces

Multimodal tasks expand the input to images, video, spatial scenes, or audio-visual clips. Unlike math/code, multimodal reasoning often lacks a universally strong verifier, so systems need to design tasks that can be rule-verified or rely on structured answer extraction, label checking, synthetic data, and specialized evaluation protocols to obtain relatively stable rewards. Works like R1-VL, MMR1, SpaceR, SPACEVISTA, InternSpatial, SPAR, and VSI-Bench embody this direction.

For GFCR, multimodal interfaces make Generate more complex: a rollout may contain visual observations, textual reasoning, and spatial relationship judgments. Filter needs to convert free-text answers into checkable structures. Control must decide if more visual evidence, more sampling, or longer reasoning is needed. Replay can reuse verified visual-language reasoning templates, spatial relationship segments, or synthetic sample generation strategies.

8.3 Agentic Interactive Benchmarks

The key difference between agentic interactive tasks and text-only tasks is that o1:T is not empty: each action by the model changes the environment and receives an observation. Software engineering benchmarks (e.g., SWE-Bench, SWE-agent, SWE-Gym, Agent-RLVR) require the model to locate issues in a codebase, edit files, run tests, and iterate based on feedback. Web agent benchmarks (BrowserGym, AgentDojo, ARLAS) require the model to click, type, browse, and handle webpage states, and may even face security risks like indirect prompt injection. Conversational simulators (RLVER, SAGE) treat the user's state and emotional trajectory as a verifiable reward source.

Rollouts in these benchmarks are typically long, exhibit significant sparse rewards, and incur expensive environmental feedback. Therefore, the role of Control is magnified: when to stop, when to backtrack, when to branch, and whether to continue invoking tools all determine the cost and success rate. Replay shifts from remembering answers to remembering processes: successful tool invocation sequences, web navigation flows, code editing strategies, and failure diagnoses can all become reusable experiences for future tasks.

8.4 Agentic Skills Benchmarks

Agentic skills benchmarks further examine whether a model can induce reusable skills from trajectories and transfer them to new tasks. In environments like WebArena, Mind2Web, and BrowserGym, Agent Workflow Memory abstracts sub-procedures into retrievable natural language workflows. Agent Skill Induction represents skills as re-executable Python functions. SkillWeaver allows agents to automatically discover and refine reusable APIs. Works like ReUseIt focus on skill reuse across different tasks and models.

In this type of interface, the four GFCR modules resemble a long-term learning loop: Generate produces candidate actions and skill invocations, Filter verifies whether skills succeed, Control decides whether to store them in the skill library or continue exploring, and Replay retrieves old skills for new tasks. The paper emphasizes that this type of scenario pushes rollout strategy from a simple sampling technique for one-shot post-training towards a design for a continuous self-improvement system: the key is not just a single task's success, but whether a maintainable, traceable, and transferable experience repository can be formed.

9 Failure Modes and Open Problems

Common rollout pathologies include: the zero-reward pattern (all rollouts fail), reward hacking, length inflation, signal noise, computational waste, and replay staleness. The GFCR framework provides a diagnostic index, mapping each pathology to specific modules and mitigation levers. Open challenges include: verifier/judge calibration, principled computational accounting, safe self-evolution with provenance tracking, and improved reporting standards to enhance reproducibility.

10 Conclusion

This survey systematically organizes rollout strategies in LLM reinforcement learning post-training through the GFCR framework. We decompose the rollout pipeline into the four modules of Generate, Filter, Control, and Replay, supplemented by the criteria of Reliability, Coverage & Informativeness, and Cost Sensitivity. Through case studies in mathematics, code, multimodal reasoning, and agentic domains, we demonstrate the framework's unified descriptive power. We provide a diagnostic index and highlight open challenges, aiming to promote the design of more reproducible, efficient, and trustworthy rollout pipelines.

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.