Hello everyone, I am PaperAgent, not an Agent!
Google recently published two papers that advance the field of Multi-Agent Reinforcement Learning (MARL) from the perspectives of mechanism design and automated discovery.
Overview
The two papers are summarized as follows. Paper 1 titled "Multi-agent cooperation through in-context co-player inference" addresses how multi-agent collaboration can be achieved through contextual co-player inference, published on February 19, 2026. Paper 2 titled "Discovering Multiagent Learning Algorithms with Large Language Models" explores automated discovery of multi-agent learning algorithms using large language models, published on February 24, 2026.
Multi-agent Cooperation
In Multi-Agent Reinforcement Learning, achieving robust collaboration among self-interested agents is a fundamental challenge. Traditional methods face two major difficulties: 1) Equilibrium selection problem: In general-sum games, multiple Nash equilibria exist, and independently optimizing agents often converge to suboptimal outcomes (e.g., mutual defection in social dilemmas). 2) Environmental non-stationarity: From a single agent's perspective, other agents learning simultaneously causes environmental dynamics to change. Existing "co-player learning awareness" methods typically rely on hard-coded assumptions or a strict separation of timescales between "naive learners" and "meta-learners".
1.2 Core Innovation: Contextual Co-player Inference
The core hypothesis of this paper is that training sequence-model agents against a diverse pool of co-players can naturally induce contextual best-response strategies without explicit meta-gradients or timescale separation.
1.3 Three-step causal chain of collaboration mechanism
The paper validates a complete causal chain from diversity to collaboration through systematic experiments:
Step 1: Diversity induces contextual best-response mechanism
Training agents only against a random pool of tabular agents reveals that agents can quickly identify opponents in a single game and converge to best-response.
Step 2: Contextual learners are vulnerable to exploitation
Freezing the agents from Step 1 as "Fixed Contextual Learners" (Fixed-ICL), new agents are trained specifically to exploit them. The new agents learn to shape the learning dynamics of Fixed-ICL to obtain higher payoffs—this is the exploitation strategy.
Step 3: Mutual exploitation drives collaboration
When two exploitation-initialized agents compete against each other, they mutually shape each other's contextual learning dynamics, ultimately converging to cooperative behavior.
1.4 Key Conclusions
Findings and implications:
Contextual learning as "fast timescale" naive learning eliminates need for explicit meta/inner loops.
Mixed training pool is critical; lack of diversity leads to mechanism degeneration.
Exploitation vulnerability as a driver of collaboration reveals new mechanisms for cooperation emergence in social dilemmas.
Theoretical contribution: The paper proposes the Predictive Policy Improvement (PPI) algorithm and proves that under perfect world model assumptions, predictive equilibrium corresponds to Subjective Embedded Equilibrium.
AlphaEvolve: Automatically Discovering Multi-Agent Learning Algorithms
The algorithm design of Multi-Agent Reinforcement Learning has long relied on manual iterative optimization. While foundational methods like CFR and PSRO have solid theoretical bases, their most effective variants often depend on human intuition to navigate the vast algorithm design space.
This paper proposes using AlphaEvolve—an evolutionary coding agent powered by large language models—to automatically discover new multi-agent learning algorithms.
2.2 Method Framework: AlphaEvolve
AlphaEvolve combines the code generation capabilities of LLMs with the rigorous selection pressure of evolutionary algorithms:
Loop:
1. Select parent algorithms based on fitness.
2. Use LLM (Gemini 2.5 Pro) to propose semantically meaningful code modifications.
3. Automatically evaluate candidate algorithms on proxy games.
4. Add valid candidates to the population.
2.3 Discovery One: VAD-CFR (Volatility-Adaptive Discounted CFR)
In the CFR domain, AlphaEvolve discovered Volatility-Adaptive Discounted (VAD-)CFR, which includes three non-intuitive mechanisms:
Mechanism 1: Volatility-adaptive discounting: Dynamically adjusts discount parameters based on instantaneous regret magnitude via EWMA. Traditional methods use fixed discount factors.
Mechanism 2: Asymmetric instantaneous boosting: Boosts positive instantaneous regret by 1.1 times. Traditional methods handle symmetrically.
Mechanism 3: Hard warm-start + regret magnitude weighting: Delays strategy averaging until iteration 500 and weights by regret magnitude. Standard CFR starts linear averaging from t=1.
Key code structure (simplified):
class RegretAccumulator: "Volatility-Adaptive Discounting & Asymmetric Boosting"
def update_accumulate_regret(self, info_state_node, iteration_number, cfr_regrets):
# 1. Compute volatility and adaptive discount
inst_mag = max(abs(r) for r in cfr_regrets.values())
self.ewma = 0.1 * inst_mag + 0.9 * self.ewma
volatility = min(1.0, self.ewma / 2.0)
# 2. Asymmetric boosting
r_boosted = r * 1.1 if r > 0 else r
# 3. Sign-dependent historical discount
discount = disc_pos if prev_R >= 0 else disc_neg
2.4 Discovery Two: SHOR-PSRO (Smoothed Hybrid Optimistic Regret PSRO)
In the PSRO domain, AlphaEvolve discovered Smoothed Hybrid Optimistic Regret (SHOR-)PSRO, whose core innovation is:
Hybrid meta-solver architecture:
- Optimistic Regret Matching (ORM): Provides stability.
- Smoothed Best Pure Strategy (Softmax): Actively biases towards high-reward modes via temperature-controlled softmax.
- Dynamic annealing schedule: Mixing factor λ anneals from 0.3 to 0.05, diversity reward decays from 0.05 to 0.001.
Asymmetric design for training and evaluation:
- Mixing factor λ: Training 0.3 → 0.05 (annealing), Evaluation fixed 0.01.
- Diversity reward: Training 0.05 → 0.001 (decay), Evaluation 0.0.
- Return strategy: Training average strategy, Evaluation last iteration strategy.
- Internal iterations: Training 1000 + 20×(population size-1), Evaluation 8000 + 50×(population size-1).
2.5 Full game test results
Two Papers Summary
Comparison across dimensions:
Core problem: Paper 1 focuses on how collaboration naturally emerges; Paper 2 focuses on how to automatically discover effective algorithms.
Key insight: Paper 1 shows contextual learning substitutes explicit meta-learning; Paper 2 shows LLMs can evolve non-intuitive symbolic algorithms.
Methodology: Paper 1 uses decentralized MARL with diversity training; Paper 2 uses evolutionary algorithms with LLM code generation.
Validation environments: Paper 1 uses Iterated Prisoner's Dilemma; Paper 2 uses Kuhn Poker, Leduc Poker, Goofspiel, Liar's Dice.
Practical significance: Paper 1 provides scalable path for Foundation Model multi-agent systems; Paper 2 shifts algorithm design from manual tuning to automated discovery.
Paper links:
Discovering Multiagent Learning Algorithms with Large Language Models
Multi-agent cooperation through in-context co-player inference
Recommended Reading
Hands-on Design of AI Agents: (Orchestration, Memory, Plugins, Workflow, Collaboration)
Sharing Two Latest Papers on Claude Skills, with 3 Core Conclusions
2026, New Trend: World Model × Embodied Intelligence Latest Survey
2026, To Do Agentic AI, You Cannot Avoid These Two Opening Reviews
One paper per day to exercise our thinking~ Having read this far, why not give a 👍, ❤️, ↗️ three-hit, plus a star ⭐, to avoid getting lost?