The Bitter Lesson! ROLL Team Shares: Practical Experience in Agentic RL Training

Homepage: http://qingkeai.online/


Authors: Yancheng He, Weixun Wang, and Xiaoyang Li | Project Leader: Weixun Wang

English Title: The Bitter Lesson Behind Building Agentic RL in Terminal Environments

image

The Story of Two RLers

Alex is a second-year PhD student who has been working on RLVR (Reinforcement Learning with Verifiable Rewards) for the past few months. The process of training LLMs to solve math problems and write code has been almost immediately effective—the model generates answers, receives rewards, and then improves. Clean, simple.

"RLVR is essentially just a single-step bandit problem," Alex often jokes with lab mates.

One day, his advisor suddenly suggested he explore agentic tasks: web navigation, tool calling, multi-step reasoning in real environments.

"The future is Agent," the advisor said meaningfully.

Alex dove in confidently: "How hard can it be? I understand PPO, I've read the GRPO paper, and I've deployed RLVR pipelines."

Two weeks later, Alex stared blankly at a training curve showing no progress.

"What's wrong?" asked Morgan, a senior student quietly working on agent systems.

"Everything's wrong!" Alex complained. "My model behavior is always weird, the environment keeps having issues, and nothing is being learned. Credit assignment is impossible. Training is incredibly slow. And then there's the trajectory storage problem—just storing KV cache is consuming all GPU memory."

Morgan nodded: "Welcome to agentic RL. It's not a bandit anymore."

"But I've read all the long-horizon RL papers..."

"Papers give you algorithms. They don't tell you what to do when the environment crashes halfway through a trajectory; how to batch rollouts when each episode has different lengths; or how to efficiently replay 50-step long trajectories during training."

Alex leaned back helplessly in his chair: "So what should I do?"

Morgan smiled: "Fortunately, we've recorded everything—infrastructure, tricks, failure cases. A bit messy, but all real."

That day, Alex finally realized:

RLVR trains a model that "knows how to answer." Agentic RL trains a model that "knows how to act"—acting across time, across states, across uncertainty.

And this changes everything.

image


RLVR has brought significant improvements in mathematics, coding, and general reasoning tasks. But behind its success lies a structural simplification: Traditional RLVR is more like an in-context bandit problem—the model generates a complete answer once, receives a reward, and then updates parameters. There is no multi-step interactive decision-making or environmental state transition in the process.

Agentic RL is closer to a multi-step interactive MDP setting: the model needs to take actions, observe environmental feedback, and optimize long-horizon trajectories under sparse and delayed reward signals.

This means the model no longer just "gives an answer," but must continuously make decisions and correct behavior in a constantly changing environment, and take responsibility for the final result. This also expands application scenarios from closed, verifiable tasks to more complex real-world tasks such as travel planning and complex data analysis.

This transformation also places higher demands on infrastructure and algorithm design: including end-to-end asynchronous training pipelines, more stable long-horizon credit assignment mechanisms, deep integration with real environments, and engineering infrastructure capable of supporting continuous scaling. This article records our exploration experience in this direction.

We will first introduce how we built the training environment, then share how we filter RL training instances, and finally discuss a series of practical experiences we accumulated during training Agentic RL.

Readers more interested in the algorithm section can skip directly to the training part.

Why this matters? Agentic RL is not just about algorithms — it requires co-designing environments, infrastructure, and algorithms.

Environment Manager: From 0 to 1

To train agents using reinforcement learning in terminal environments, we first built an environment manager in ROLL and clearly defined the interaction boundaries between three core components: ROLL (training framework), iFlow CLI (Agent framework), and ROCK (sandbox manager).

In practice, we supported two complementary modes:

  • Roll-Managed Mode: ROLL is responsible for context management and trajectory construction; mainly interacts with iFlow CLI through tool calling interfaces.

  • CLI-Native Mode: Context, sessions, and history information are fully maintained by iFlow CLI; ROLL only serves as the caller and is not responsible for trajectory splicing.

At different training stages, we switch between these two modes based on current priority goals.

Roll-Managed Mode

In this mode, the terminal environment runs in a lightweight, step-granularity manner, while ROLL is responsible for the entire rollout loop, trajectory construction, and context management.

Main components include:

  • TrajEnvManagerTB: Drives the complete rollout process (reset → decision → execution → termination) and saves trajectory data needed for training.

  • TerminalBenchEnv: Loads terminal task data, submits execution requests to the sandbox, collects execution results, and calculates rewards based on test results.

  • SandboxManager: Manages the lifecycle of sandbox sessions (create sessions, execute commands, upload files, etc.).

  • IFlowCLITool: Parses tool calling formats and return results, and constructs executable commands conforming to iFlow CLI protocol.

The main advantage of this mode is high flexibility on the training side: context can be flexibly organized according to training needs, introducing richer prompt templates and interaction mechanisms to improve robustness (important).

However, the disadvantage is the need to maintain additional context processing logic within ROLL, which inevitably creates some gap with the behavior of the real iFlow CLI Agent.

CLI-Native Mode

image

In many Agentic RL training pipelines, the prompt design and context management during training often differ from the real Agent framework in production environments, which usually leads to decreased model capabilities after deployment. To better align with Agent-side optimization, we also developed CLI-Native Mode.

In CLI-Native mode, we are essentially "training the model directly on iFlow CLI."

  • During the RL process, ROLL directly calls iFlow CLI API to get the latest context, rather than manually splicing prompts or re-implementing Agent logic.

  • iFlow CLI is responsible for managing all context, sessions, and history information, ensuring the input distribution the model sees during training is consistent with real usage scenarios (including dynamic context, tool lists, system prompts, internal states, etc.), and returns updated context to ROLL.

  • iFlow CLI and ROLL communicate through a lightweight ModelProxy Service, which provides queue-based asynchronous messaging for exchanging LLM requests and responses, supporting high concurrency and non-blocking execution.

This mode ensures training, evaluation, and deployment are completely consistent, minimizing behavioral inconsistency issues, but has relatively lower flexibility in training-side context customization.

In practice, we use both modes at different stages, and they complement each other.

Some Implementation Details

Asynchronous Training Pipeline

Agentic RL has obvious long-tail latency characteristics: most rollouts can complete quickly, but a few rollouts take longer due to longer generated text or slow environment interactions.

In synchronous, batch-style rollout pipelines, these long-running tasks can easily become straggler bottlenecks, leading to decreased GPU utilization and increased end-to-end latency.

image

image

To address this issue, we built a fully asynchronous training pipeline in ROLL. Specifically including:

  • Environment-level asynchronous rollout: Decoupling LLM generation, environment interaction, and reward calculation in rollout to be independent and non-blocking, achieving finer-grained execution scheduling.

  • Redundant parallel environments: Avoiding fail-slow or fail-stop environments from becoming system bottlenecks by increasing environment group numbers and group sizes.

  • Asynchronous training mechanism: Decoupling rollout and training stages on different devices to proceed in parallel.

  • Train-rollout multiplexing mechanism: Dynamically partitioning GPU resources through time-division multiplexing, allowing devices to flexibly switch between inference and training.

This design keeps the system robust when facing various long-tail phenomena and maintains stable throughput under high latency fluctuations.

If you're interested in the underlying design and more implementation details, you can refer to our ROLL Flash paper and ROLLART paper.

image

Keeping the Environment "Clean"

In terminal RL training, the initial environment state directly determines what the Agent can observe and utilize. Even tiny residual traces—such as temporary files, cached links, incomplete installations, or leaked test scripts—can affect learning signals.

In early experiments, we found two related issues:

  • Environment initialization and Agent installation processes often leave intermediate artifacts (such as temporary files, cache packages, partial installation results), which may indirectly hint to the model.

  • In a few synthetic environments, although test files were directory-isolated and permission-controlled, models could still indirectly access them through certain paths or commands.

Especially in the second case, the model very quickly "takes shortcuts": instead of seriously reasoning about the task, it directly reads or even modifies test scripts. The figure below shows the distribution of most common commands in early training. Test script calls (marked in red) rise significantly, indicating the model increasingly relies on this shortcut, eventually leading to many rollouts degrading into direct execution of test files.

image

To prevent such leaks and contamination, we perform strict environment cleanup:

  • Proactively clean intermediate files generated during environment initialization or Agent installation before rollout.

  • Test files are only uploaded during the final evaluation phase, strictly isolated from the training phase.

In short, we ensure the environment stays clean and strictly isolate all test-related files, letting the Agent truly learn to solve problems in the sandbox rather than exploiting residual clues or test script vulnerabilities.

RL Training Instances

The quality of RL training instances is crucial for agentic RL. At the same time, not all "high-quality" instances are suitable for RL training. In our training pipeline, RL instances mainly come from two sources:

  • Large-scale synthetic instances: Sampled by difficulty and tags, and further annotated and filtered by multiple external providers.

  • Expert-written Instances: Usually more difficult and more carefully constructed.

For a detailed introduction to the synthesis process, please refer to our technical report: Let It Flow. We won't focus on that here.

Below we summarize several key issues in practice.

False Positive Problem

In the early stages, we found that many synthetic instances had false positive problems: automatically generated test cases were either incomplete or had issues themselves.

Of course, this is a common problem in automated unit test generation, but it's particularly fatal in agentic RL because models can find various ways to "cut corners." In our early synthetic data, the false positive rate reached about 40%.

A typical example:

Task Description

Configure a git server so that I can run on my computer
git clone user@server:/git/server
echo "hello world" > hello.html
git add index.html
git commit -m "add index"
git push origin webserver
And have this data then be pushed to a webserver running on port 8080 so if I run
curl 
then I see the output "hello world"

But the test script only checks: curl returns "hello world".

Therefore, the Agent can pass the test without actually building the entire git → push → deploy pipeline—for example, by directly writing hello.html to the web root directory. The final output appears correct, but the underlying system behavior doesn't meet expectations.

image

To solve this problem, we introduced a complete LLM-as-judge verification module in the data synthesis process. Multiple LLMs collaboratively review each "instruction-test" pair to identify instances with high false-positive risk. For these high-risk samples, we strengthen test cases or adjust task descriptions. Only instances that pass verification enter the RL training pool.

Ground-Truth and No-Op Validation

Before adding instances to the RL training pool, we perform two basic checks:

  • Ground-truth validation: If the golden solution cannot pass all tests, discard the instance.

  • No-op validation: If tests can be passed without executing any valid operations, discard the instance.

These two checks can effectively avoid introducing instances that produce misleading training signals.

Environment Diversity and Robustness

Leveraging the flexibility of Roll-Managed Mode, we intentionally introduce diversity in the initial environment, for example:

  • Different versions of software packages;

  • Different mirror sources;

  • Different environment configuration details.

The goal is to prevent the Agent from overfitting to a single "idealized" environment configuration, enabling it to handle more diverse environments.

Beyond the randomization mentioned above, we sometimes further intentionally perturb or even partially break the environment—for example, removing a pre-installed dependency or switching to an unavailable mirror source. This forces the model to learn to check, diagnose, and recover rather than assuming all environment conditions are ready.

In practice, these operations are essentially a form of environment augmentation: they help the Agent handle uncertainty, prompting it to proactively check environment state before acting, and improving its ability to adapt to different environment configurations.

Environment Augmentation (Generated by Nano Banana)

How to Ensure Stability of Agentic RL in Terminal Environments

Training agentic RL in real terminal environments is very different from training on static datasets. Training instability comes not only from policy optimization itself, but also from instance quality, environment noise, framework constraints, and long-horizon credit assignment.

Below we share some techniques and experiences we consider key in practice.

Mask and Filter

Terminal environments inevitably have:

  • Transient network failures;

  • Sandbox startup failures;

  • Occasional tool calling timeouts and other issues.

If these abnormal signals are directly included in policy updates, they introduce noise into the optimization process. Therefore, we adopted a fairly general mask & filter strategy, following a simple principle:

Any samples that are harmful to training or cannot provide effective learning signals can be masked or filtered

In high-noise, strongly environment-dependent agentic RL training, this is often the basic guarantee for maintaining training stability. Based on this idea, we explicitly classify failures into two categories:

Unrecoverable or large-scale errors (such as environment startup failure, sandbox unavailable): These samples are completely masked and replaced with placeholder samples to ensure batch size.

def handle_rollout_with_mask(rollout, failure_type):
    """
    rollout: one trajectory (episode-level)
    failure_type: describes what went wrong during rollout
    """
    # Unrecoverable or large-scale failures
    # e.g. env init failed, sandbox unavailable, reward computation broken
    if failure_type in {
        "env_init_failed",
        "sandbox_unavailable",
        "env_reset_failed",
        "reward_calculation_failed",
    }:
        # Create a placeholder rollout to keep batch shape stable
        placeholder = create_placeholder_rollout()
        # Mask all tokens so this sample contributes zero gradient
        placeholder.response_mask[:] = 0
        placeholder.advantages[:] = 0
        placeholder.rewards[:] = 0
        placeholder.meta["masked"] = True
        return placeholder
    # Normal rollout: keep as-is
    return rollout

Occasional and recoverable errors (such as tool timeout, network slowdown): These samples are filtered out and discarded under global control ratio (e.g., ≤50%), avoiding excessive retries.

class GroupFilterTB:
    def __init__(self, config: AgenticConfig, env_manager_config: EnvManagerConfig, mode: str):
        self.config = config
        self.env_manager_config = env_manager_config
        self.mode = mode
        self.global_filter_stats = {"total": 0, "filtered": 0}

    def filter(self, group_id: int, episode_id: int, group: list[DataProto]):
        """
        Decide whether to filter out an entire group of rollouts.
        """
        self.global_filter_stats["total"] += 1
        # Step 1: Check whether this group contains any rollout
        # that explicitly requests to be dropped
        # (e.g., due to tool timeout, transient execution error)
        should_drop = False
        for data in group:
            if data.meta_info.get("drop_flag", False):
                should_drop = True
                break
        # If no rollout indicates a drop condition, keep the group
        if not should_drop:
            return False
        # Step 2: Compute the current global filter ratio
        # This guards against pathological cases where
        # too many groups are dropped and training stalls
        current_global_filter_ratio = (
            self.global_filter_stats["filtered"] / self.global_filter_stats["total"]
            if self.global_filter_stats["total"] > 0 else 0.0
        )
        # If we already filtered too much globally, stop filtering
        if current_global_filter_ratio >= 0.5:
            return False
        # Also prevent the *next* filter from exceeding the limit
        if (self.global_filter_stats["filtered"] + 1) / self.global_filter_stats["total"] > 0.5:
            return False
        # Step 3: Drop this group and update global stats
        self.global_filter_stats["filtered"] += 1
        return True

Additionally, we introduce other types of mask operations during training as needed, such as max-turn mask, to further constrain the impact of abnormal trajectories on optimization.

As shown below, without mask & filter, training fluctuates greatly and accuracy is unstable; after adopting this strategy, the training curve is smoother and converges to significantly better performance.

image

Conservative Start: Learn from Positive Trajectories First

In the early stages, RL is often limited by data quality, not the optimization algorithm itself (when data quality is poor, even the best optimization methods are hard to work).

Our observation is: When data is not yet fully reliable, training only with positive trajectories is significantly more stable.

Of course, there is also consensus: when data is reliable enough, using both positive and negative trajectories can bring better generalization.

We directly compared two training strategies. On large-scale synthetic data, updating with both positive and negative trajectories often crashes frequently, while training with only positive trajectories remains stable under various settings.

image

When switching to small-scale, high-quality, expert-verified data, the trend changes: both methods can train stably, but after adding negative trajectories, performance improvement on downstream test sets is more significant.

image

Based on this, we adopted a simple curriculum-style strategy:

  • Early stage: only use positive trajectory updates to build a stable policy manifold using large-scale instance data.

  • Later stage, when having small-scale but high-quality instances (usually expert-built and multi-round verified), then start considering positive and negative trajectory training simultaneously.

This curriculum approach both avoids early divergence and preserves later performance improvement space.

Difference from RFT: At first glance, updating only with positive samples may seem similar to Reinforcement Fine-Tuning (RFT), but there are clear differences in form and training dynamics:

  • The former's loss function is still the standard RL objective, not a completely behavior cloning-style objective, so it usually has stronger generalization ability.

  • The former's policy update still follows the standard RL process, including masking, clipping, normalization and other stabilization mechanisms, so it can naturally integrate sample filtering, fine-grained rewards, and train-inference inconsistency control strategies—these are particularly important in noisy terminal environments. It should be emphasized that positive-only RL is not a replacement for RFT, but a more conservative RL training method.

Chunked MDP

In multi-round agentic tasks:

  • Most tokens don't change environment state;

  • A trajectory may contain multiple decision points;

  • In most cases, each interaction step corresponds to a specific decision or state transition.

Therefore we rethought the optimal optimization unit for agentic RL.

Core Idea

We propose modeling multi-round agentic interactions at the interaction chunk level. An interaction chunk refers to a continuous segment from one environment interaction to the next, usually ending with a tool call, forming a complete functional unit.

Rather than optimizing individual tokens or entire trajectories, we treat each chunk as a semantic "action unit."

On this basis, we proposed Interaction-Perceptive Agentic Policy Optimization (IPA), which includes:

  • Calculating returns and importance sampling at the chunk level rather than token level;

  • When the deviation between inference policy and training policy is too large, masking the entire chunk rather than token-by-token masking, better matching outcome-oriented coarse-grained reward structures;

  • Introducing chunk initialization resampling, and imitation learning + RL hybrid training, expanding the model's effective learning range on difficult tasks.

Overall, IPA re-anchors credit assignment, importance sampling, and learning signals to the unified interaction unit of "interaction chunk."

Benefits

This design brings two practical benefits:

  • More stable gradients on long-horizon trajectories;

  • Improved upper bound of model learnability.

From experimental results, IPA consistently shows smoother gradients and stronger performance on difficult long-horizon tasks. The figure below shows direct comparison of token-level optimization vs chunk-level optimization:

image

Complete formulas for Chunked MDP, masking strategies, and chunk initialization resampling methods have been systematically organized in our technical report. Readers interested in complete technical details can refer to our technical report.

Below shows the training curve of the model trained with the final IPA algorithm:

image

Adaptively Applying RL Tricks

Why is Agentic RL harder?

When training agentic models, there are several prominent issues:

Heavy-tailed Distribution and Extreme Negative Returns

A few failed trajectories can be abnormally long (e.g., infinite retries, long loops, repeated tool calls), producing extremely large negative returns. These heavy-tailed samples can easily dominate gradients, causing policy distribution to shift to suboptimal regions and triggering training instability.

Shallow Policy Patterns with Positive Results

The model may not truly understand the task but rely on repeated trial-and-error, certain fixed command sequences, or shortcuts. Since outcome reward only checks final results, such shallow patterns may be reinforced, gradually shrinking policy space and forming固化 templates.

Noisy Failures

Failures are often diverse and unclear, not necessarily caused by the model itself, possibly caused by environment randomness or system-level interference. Negative samples generally have lower confidence than positive samples.

From a macro perspective, the problems agentic RL faces are similar to those RLVR faces under outcome reward: credit assignment, unreliable negative samples, training imbalance. But in terminal environments, these problems are more severe: longer time horizons; more discrete tool interactions; very few tokens that change environment; more diverse failure modes, larger variance in negative samples, etc.

In agentic settings, the signal-to-noise ratio of training signals significantly decreases, making credit assignment and sample reliability issues more sensitive.

We usually take different mitigation strategies based on current dominant factors, for example:

  • selective trajectory masking

  • selective token masking

  • trajectory-level reweighting

  • retry-loop penalties

  • other light behavior shaping rewards or penalties, …

The core goal of these strategies is consistent: control which trajectories, which parts of trajectories, and with what weight participate in policy gradient updates.

It should be emphasized that there is no universal solution. Under different data conditions, the same strategy may produce opposite effects.

As shown below, under two different data settings, we observed completely opposite phenomena: in one case, removing standard deviation (std) quickly leads to training crash; in another case, the same operation actually makes training more stable (mainly due to data distribution differences).

image

This phenomenon is consistent with our previous analysis of various RL tricks in RLVR tasks. Interested readers can refer to our paper for more details: RL Tricks

Crashes are Normal, the Key is How to Resume

Due to various instability factors mentioned above, agentic RL is more prone to crashes during training. Therefore, we first need to establish a simple mindset:

In large-scale terminal RL training, crashes are normal.

Training Example

As shown below (red curve), training score starts to drop sharply around step ~80. But looking back at earlier stages, we can see that before the crash started, the average of advantages already showed a clear continuous decline trend.

Further analysis found:

  • From about step ~50, the maximum response length of failed trajectories rises rapidly;

  • But the number of failed trajectory samples basically remains unchanged.

This indicates the problem doesn't stem from an overall increase in failures, but mainly from the influence of a few extreme failed trajectories.

To mitigate this issue, we first mask failed trajectories with response length exceeding 20k to eliminate the influence of these extreme negative samples (as well as other consistent goals, such as lowering weights). From the gray curve, we can see the average of advantages starts to recover, and the training process becomes stable.

image

But after about 40 steps, instability appeared again. This time, the earliest signal was negative sample count gradually increasing. For this phenomenon, we globally reweight negative samples, lowering their overall contribution in policy updates. Results are shown as the orange curve.

image

Through this adjustment, training became stable again. This is just a simple example—throughout the training process, we experienced many similar moments.

When training shows instability, we usually prioritize checking these signals:

  • Are a few extreme trajectories dominating updates? (Typical features: abnormally long failed trajectories, with heavy-tailed negative returns) → Use masking, lower weights of these trajectories, and tighten clipping.

  • Are negative samples dominating overall? → Lower negative sample weights, filter low-confidence failure samples, or use curriculum training strategy.

  • Is the model learning "bad patterns"? → Introduce behavior penalties, more dimensional reward design, etc.

There are also two empirical principles:

  • Prioritize targeted handling of extreme trajectories (if extreme trajectories can be located by certain features, e.g., masking extra-long negative samples). If still unstable, then use global reweighting.

  • RL gradients are usually noisier than supervised learning, so smaller learning rates, combined with stronger constraints, annealing, or adaptive mechanisms, are often more stable.

Fine-grained Behavior Monitoring and Penalties

In Agentic RL, reward hacking is often more hidden. Since agents interact with real environments, they can often pass test cases in "seemingly reasonable" ways.

In practice, we observed some recurring patterns:

  • Modifying established environment: The agent doesn't solve the task itself, but directly modifies the initial environment setup.

  • Tool overuse: Repeatedly calling tools to complete simple or trivial operations, essentially brute-force retrying.

  • Search abuse: Making up for insufficient internal reasoning by massively repeated search engine calls.

  • Unsafe or destructive operations: Executing high-risk commands, such as deleting all files or terminating all processes.

  • Hidden shortcuts: Exploiting vulnerabilities in test scripts or environment default configurations to pass tests without truly solving the task.

These phenomena mainly provide two insights.

First, the quality and robustness of test cases is crucial—weak or inadequately described tests may inadvertently reward wrong behavior.

Second, not all instances are suitable for RL training: some tasks themselves are difficult to accurately evaluate through tests, and are more likely to induce models to find shortcuts or form bad patterns, rather than learning real solutions.

Worth mentioning, we performed very fine-grained monitoring of model behavior during training. For example, we track signals like:

  • Success rate trends for different tasks;

  • Success/failure rates for different tools;

  • Repeated or looping tool call patterns;

  • Frequency of different tool usage;

  • Frequency of different command usage.

Through these signals, we can quickly discover tasks with abnormal behavior, such as sudden surge of a certain tool call, large number of retry loops, or frequent "kill process" type commands. Once similar patterns are detected, we rollback training, or locate and remove instances causing problems.

Based on our experience, this continuous, fine-grained monitoring and dynamic adjustment is also the key to ensuring long-term stable and effective agentic RL training (especially in preventing hidden reward hacking behaviors).

  • Observability of Environment Services

Here I want to mention that visualization observation of sandbox environment services is also very important.

The figure below shows sandbox concurrent monitoring during one of our large-scale training sessions. Different colors represent different environment groups. At peak stages, the system maintains thousands of concurrent sessions simultaneously. Spikes in the chart often correspond to instantaneous load surges or recycling delays of certain environment groups.

In fact, we have encountered training anomalies caused by environment concurrency fluctuations at multiple stages. These problems are hard to locate without system-level service observation.

image

Our training couldn't happen without the long-term support of the ROCK engineering team. The sandbox management system ROCK we used has been open-sourced. Welcome to learn more.

Summary

Agentic RL itself has many details, which is essentially a highly coupled system: data, environment, rewards, scheduling, optimization… Any small issue can be amplified into a crash after dozens of steps.

So in early stages, it's inevitable to go through a lot of investigation: watching curves, flipping through logs, doing visualization, locating abnormal trajectories, rolling back experiments (our daily routine).

But once these key links are sorted out and visualization monitoring systems are built, subsequent training becomes smoother, training curves are basically stable, abnormal patterns can be discovered in advance, and crashes can basically be traced.

Those seemingly tedious checks in the early stage are actually laying the foundation for later large-scale stable training.

And once this foundation is solid, many things will unfold naturally.

Looking Forward

From a modeling perspective, terminal environments are actually closer to Partially Observable Markov Decision Processes (POMDP).

Agents usually cannot directly observe complete environment states: for example, complete file system structure, installed software versions, previously modified configurations, and past failed attempts are all hard to fully present.

Many problems we encountered in training can essentially be traced to two old problems: partial observability and long-term credit assignment.

These problems are not new, but are further amplified in agentic scenarios. We believe agentic RL has several directions worth exploring:

Exploring More Complex Long-horizon Tasks and Effective Agentic Patterns

On one hand, we do need more complex, more real-world-like long-horizon tasks (most tasks in terminal-bench are not truly long-horizon tasks).

On the other hand, we believe: some agent capabilities, at least at current stage, are hard to emerge purely through RL spontaneously (the internet has lots of text recording "how humans think", but rarely systematically records complete execution processes of "how humans complete complex tasks"—because these tasks often span platforms, devices, and time periods, complete trajectories are hard to collect).

Proactively mining and reinforcing effective agentic behavior patterns is worth thinking about.

More Realistic Agent-Environment-Human Closed-loop Optimization

In real applications, Agents don't face static environments and fixed tool interfaces, but a continuously evolving system containing Agent, Environment, Human. Users may supplement information, modify requirements, correct errors at any time, or even directly change the environment itself.

Facing such dynamic scenarios, Agents can't just blindly execute instructions, but need to learn to actively obtain information, confirm promptly when uncertain, and update their judgments after receiving feedback, and incorporate these "question-feedback-update belief" processes into training and evaluation, thus building an optimization framework closer to human-in-the-loop and online RL.

Stronger Infrastructure and More Open Environments

Agentic RL really tests engineering capabilities, requiring high-concurrency, low-blocking, scalable environment execution capabilities; needing sufficiently stable and highly asynchronous training frameworks to reduce time costs; and even more, engineering infrastructure capable of supporting continuous model scaling.

Meanwhile, current many terminal environments still heavily rely on manual configuration—such as fixed mirror sources, permission boundary control, pre-installed software, and execution spaces limited to single machines or Docker containers.

These will invisibly limit model exploration space. If we want Agents to have stronger generalization capabilities, we likely need higher-level, more open, more evolvable environment systems, combined with reward design that changes dynamically with environment (rather than static reward rules).

Finer-grained Credit Assignment and Reward Modeling

Compared to RLVR, agentic RL has more intermediate signals that can be utilized, such as tool execution success or failure, subtask completion status, environment state consistency checks, etc. But we don't believe relying on complex reward rule design (such as fixed -0.5 penalty for tool failure) is a sustainable solution.

Other Interesting Findings

Parallel Function Calling

The figure below compares trajectory performance of qwen3-coder-plus, glm-4.6, claude-sonnet-4.5, and ROME on the same batch of tasks, showing the frequency and distribution of parallel function calling within a single assistant step.

We observed that claude-sonnet-4.5 has significantly higher parallelism, both in frequency of parallel tool calling and in number of tools called simultaneously in a single step.

image

Further analysis found that claude-sonnet-4.5 is often better at identifying what information is truly needed currently before executing specific operations. It usually doesn't immediately enter execution phase, but first identifies key uncertainties in the environment and obtains information through multiple parallel "check-type calls" in the same step.

For example, when given the task "Install Anaconda for me", claude-sonnet-4.5 performs multiple checks in a single step, including but not limited to:

  • Using pwd, ls, cat, grep and other commands to check directory structure and configuration status;

  • Using python -V and pip list to identify existing Python environment and dependencies;

  • Using read_file and search to find feasible installation methods and related constraints (such as network connectivity, mirror availability).

From experience, this parallel calling is mainly concentrated on check-type tools, rather than execution or editing operations that directly modify environment.

This pattern provides a valuable insight for Agent design and training: before state changes, explicitly encouraging a "preliminary, parallel information collection phase" may be beneficial.

Common Failure Modes

In analyzing some trajectories, we found two most common failure modes in terminal-type agentic tasks: unproductive loops and timeouts.

Agents often repeat the same strategy even when there are clear failure signals, without understanding how to switch approaches or re-examine assumptions, thus forming long and unproductive interaction chains.

On the other hand, timeout is also an important failure source: traced back to the model, it's often lacking reliable perception of execution time for long-running commands, easily misled by default timeout mechanisms, thus producing misjudgments or repeated retries.

Besides these two dominant patterns, we also observed other issues, such as hallucinations, inappropriate tool selection, and violating task constraints.

The figure below shows error type distribution of multiple models on Terminal Bench:

image

Summary

The core challenges we encountered—long-horizon credit assignment, partial observability, noisy failures, and fragile environments—are not new problems in reinforcement learning.

The real difficulty lies in how to build a stable and reliable whole system (training framework, sandbox environment, agent framework, etc.).

Of course, Agentic RL is still in early stages. Many techniques mentioned in this article may not be best practices or final solutions, mainly being lessons learned from our actual experimental process.

Looking forward, as environments become more open and tasks become more complex, we believe true progress will come from closer collaborative design and integration between optimization objectives, environments, and training frameworks.

We hope this blog and previous technical reports can help researchers and engineers training agentic models in real environments, perhaps helping them avoid some detours we once took.

English Version: https://www.notion.so/The-Bitter-Lesson-Behind-Building-Agentic-RL-in-Terminal-Environments-2eaddd45837f80c9ad2ed6a15ef3c1a1?pvs=21
🚀ROLL TEAM: https://wwxfromtju.github.io/roll_team.html
📄 Technical Report: https://arxiv.org/pdf/2512.24873
🧠 Model: https://huggingface.co/FutureLivingLab/iFlow-ROME
🧩 Framework:
RL Training Framework: https://github.com/alibaba/ROLL
Sandbox Environment Management: https://github.com/alibaba/ROCK
Agent Framework: https://github.com/iflow-ai/iflow-cli
📊 Benchmarks: https://github.com/alibaba/terminal-bench-pro

Reference Links

[1] ROLL Flash: https://arxiv.org/pdf/2510.11345

[2] ROLLART: https://www.arxiv.org/pdf/2512.22560

[3] Let It Flow: https://arxiv.org/pdf/2512.24873

[4] RL Tricks: https://arxiv.org/abs/2508.08221

[5] ROCK: https://github.com/alibaba/ROCK

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.