LAMER: Meta-Reinforcement Learning Enables Language Agents to Perform Active Exploration

Reinforcement learning has enabled Large Language Model (LLM) agents to interact with environments and solve multi-turn long-horizon tasks. However, agents trained via RL often struggle with tasks requiring active exploration, failing to adapt effectively from trial-and-error experiences. The paper proposes LAMER (LLM Agent with Meta-RL), a general meta-reinforcement learning framework that enables LLM agents to actively explore and learn from environmental feedback during testing.

LAMER Framework Diagram

Core Problem: The Balance of Exploration and Exploitation

Unlike humans who can systematically explore and adapt quickly to new environments, LLM agents cannot robustly explore without significant intervention. Existing works attempt to guide LLM exploration offline or induce exploration strategies from offline search trajectories. However, these methods either focus on single-turn non-agent reasoning problems or rely on offline data, limiting them to imitation rather than active exploration.

The paper naturally formulates the exploration-exploitation balance of multi-turn tasks as a cross-episode reinforcement learning framework. Since multi-turn tasks typically have sparse success signals only after an episode ends, the paper employs a multi-episode mechanism, treating the episode as the unit of exploration and exploitation. By training in multiple similar but different environments, meta-reinforcement learning is formed: the agent is forced to discover general strategies effective in unseen, potentially more difficult environments.

Cross-Episode Learning Diagram

LAMER Framework Design

LAMER includes two key design principles:

(1) Cross-Episode Training Framework: Unlike standard single-episode RL, LAMER is designed around a multi-episode structure, training the agent to solve problems through trial and error. In early episodes, the agent is encouraged to collect diverse experiences and environmental feedback, then uses this information to adjust strategies in subsequent episodes. By maximizing long-term rewards across episodes, the agent internalizes a learning algorithm that explicitly incentivizes exploration to improve downstream exploitation.

(2) Reflection-based Contextual Policy Adaptation: During training and testing, the agent effectively utilizes feedback and reflections from previous episodes to determine the strategy for the next episode. This essentially implements an RL algorithm within the context, making the method naturally suited for LLM agents.

Meta-RL produces more diverse samples while achieving higher performance, balancing exploration and exploitation better than standard RL.

Exploration vs Exploitation Trade-off Visualization

Experimental Results

The paper evaluates LAMER on four challenging long-horizon tasks: Sokoban, MineSweeper, Webshop, and ALFWorld. Using Qwen-3 4B, LAMER consistently outperformed prompting and RL baselines in all environments:

(1) On Sokoban, an absolute improvement of 11% over the RL baseline.

(2) On MineSweeper, an absolute improvement of 14% over the RL baseline.

(3) On Webshop, an absolute improvement of 19% over the RL baseline.

Visualization of the Exploration-Exploitation Trade-off

The paper contrasts RL and Meta-RL training results in the MineSweeper environment: Meta-RL training achieves better success rates while maintaining higher sample diversity in the base model, realizing a better exploration-exploitation trade-off. By aggregating the empirical probability distributions of multiple sampling trajectories, Meta-RL trained models generate more diverse and exploratory trajectories.

Generalization Capability

The paper demonstrates that LAMER-trained models exhibit better generalization capabilities for more difficult and out-of-distribution tasks. The trained models learn to balance exploration and exploitation, outperforming standard RL in scaling performance during testing (via pass@k).

To the best of the authors' knowledge, this is the first time a meta-RL framework has been applied to LLM agent training. Overall, LAMER represents a step towards autonomous agents capable of acting proactively to discover information and improve decision-making in new environments.

Original Article Title: META-RL INDUCES EXPLORATION IN LANGUAGE AGENTS
Article Link: https://arxiv.org/pdf/2512.16848

LAMER: Meta-Reinforcement Learning Enables Language Agents to Perform Active Exploration

Related Articles

分享網址