Recent Google Publications: Two Notable Papers on Multi-Agent Systems

Hello everyone, I am PaperAgent, not an Agent!

Google recently published two papers that advance the field of Multi-Agent Reinforcement Learning (MARL) from the perspectives of mechanism design and automated discovery.

Banner image for the article on Google's Multi-Agent papers

Overview

The two papers are summarized as follows. Paper 1 titled "Multi-agent cooperation through in-context co-player inference" addresses how multi-agent collaboration can be achieved through contextual co-player inference, published on February 19, 2026. Paper 2 titled "Discovering Multiagent Learning Algorithms with Large Language Models" explores automated discovery of multi-agent learning algorithms using large language models, published on February 24, 2026.

Multi-agent Cooperation

In Multi-Agent Reinforcement Learning, achieving robust collaboration among self-interested agents is a fundamental challenge. Traditional methods face two major difficulties: 1) Equilibrium selection problem: In general-sum games, multiple Nash equilibria exist, and independently optimizing agents often converge to suboptimal outcomes (e.g., mutual defection in social dilemmas). 2) Environmental non-stationarity: From a single agent's perspective, other agents learning simultaneously causes environmental dynamics to change. Existing "co-player learning awareness" methods typically rely on hard-coded assumptions or a strict separation of timescales between "naive learners" and "meta-learners".

1.2 Core Innovation: Contextual Co-player Inference

The core hypothesis of this paper is that training sequence-model agents against a diverse pool of co-players can naturally induce contextual best-response strategies without explicit meta-gradients or timescale separation.

Figure 1: Mixed training induces robust collaboration. RL agents trained in a mixed pool (learning agents + tabular agents) converge to cooperation (solid line). Ablation shows that training only against other learning agents (dashed) or providing explicit co-player identity (dotted) leads to defection.

1.3 Three-step causal chain of collaboration mechanism

The paper validates a complete causal chain from diversity to collaboration through systematic experiments:

Step 1: Diversity induces contextual best-response mechanism

Training agents only against a random pool of tabular agents reveals that agents can quickly identify opponents in a single game and converge to best-response.

Figure 2A-B: Emergence of contextual best-response. PPI agents (trained only against tabular opponents) demonstrate rapid adaptation to different fixed strategies during evaluation.

Step 2: Contextual learners are vulnerable to exploitation

Freezing the agents from Step 1 as "Fixed Contextual Learners" (Fixed-ICL), new agents are trained specifically to exploit them. The new agents learn to shape the learning dynamics of Fixed-ICL to obtain higher payoffs—this is the exploitation strategy.

Figure 2C-D: Learning to exploit contextual learners. The newly trained RL agent exploits the adaptive tendency of Fixed-ICL, forcing it into unfair cooperation.

Step 3: Mutual exploitation drives collaboration

When two exploitation-initialized agents compete against each other, they mutually shape each other's contextual learning dynamics, ultimately converging to cooperative behavior.

Figure 2E-F: From mutual exploitation to collaboration. The mutual shaping of two exploitation strategies promotes the learning of cooperative behavior both within single games (F) and across training episodes (E).

1.4 Key Conclusions

Findings and implications:

Contextual learning as "fast timescale" naive learning eliminates need for explicit meta/inner loops.

Mixed training pool is critical; lack of diversity leads to mechanism degeneration.

Exploitation vulnerability as a driver of collaboration reveals new mechanisms for cooperation emergence in social dilemmas.

Theoretical contribution: The paper proposes the Predictive Policy Improvement (PPI) algorithm and proves that under perfect world model assumptions, predictive equilibrium corresponds to Subjective Embedded Equilibrium.

AlphaEvolve: Automatically Discovering Multi-Agent Learning Algorithms

Discovering Multiagent Learning Algorithms with Large Language Models

The algorithm design of Multi-Agent Reinforcement Learning has long relied on manual iterative optimization. While foundational methods like CFR and PSRO have solid theoretical bases, their most effective variants often depend on human intuition to navigate the vast algorithm design space.

This paper proposes using AlphaEvolve—an evolutionary coding agent powered by large language models—to automatically discover new multi-agent learning algorithms.

2.2 Method Framework: AlphaEvolve

AlphaEvolve combines the code generation capabilities of LLMs with the rigorous selection pressure of evolutionary algorithms:

Loop:

1. Select parent algorithms based on fitness.

2. Use LLM (Gemini 2.5 Pro) to propose semantically meaningful code modifications.

3. Automatically evaluate candidate algorithms on proxy games.

4. Add valid candidates to the population.

2.3 Discovery One: VAD-CFR (Volatility-Adaptive Discounted CFR)

In the CFR domain, AlphaEvolve discovered Volatility-Adaptive Discounted (VAD-)CFR, which includes three non-intuitive mechanisms:

Mechanism 1: Volatility-adaptive discounting: Dynamically adjusts discount parameters based on instantaneous regret magnitude via EWMA. Traditional methods use fixed discount factors.

Mechanism 2: Asymmetric instantaneous boosting: Boosts positive instantaneous regret by 1.1 times. Traditional methods handle symmetrically.

Mechanism 3: Hard warm-start + regret magnitude weighting: Delays strategy averaging until iteration 500 and weights by regret magnitude. Standard CFR starts linear averaging from t=1.

Figure 1: Performance of CFR variants on training and test games. VAD-CFR (purple line) demonstrates fastest convergence and lowest exploitability on most games.

Key code structure (simplified):

class RegretAccumulator: "Volatility-Adaptive Discounting & Asymmetric Boosting"

def update_accumulate_regret(self, info_state_node, iteration_number, cfr_regrets):

# 1. Compute volatility and adaptive discount

inst_mag = max(abs(r) for r in cfr_regrets.values())

self.ewma = 0.1 * inst_mag + 0.9 * self.ewma

volatility = min(1.0, self.ewma / 2.0)

# 2. Asymmetric boosting

r_boosted = r * 1.1 if r > 0 else r

# 3. Sign-dependent historical discount

discount = disc_pos if prev_R >= 0 else disc_neg

2.4 Discovery Two: SHOR-PSRO (Smoothed Hybrid Optimistic Regret PSRO)

In the PSRO domain, AlphaEvolve discovered Smoothed Hybrid Optimistic Regret (SHOR-)PSRO, whose core innovation is:

Hybrid meta-solver architecture:

- Optimistic Regret Matching (ORM): Provides stability.

- Smoothed Best Pure Strategy (Softmax): Actively biases towards high-reward modes via temperature-controlled softmax.

- Dynamic annealing schedule: Mixing factor λ anneals from 0.3 to 0.05, diversity reward decays from 0.05 to 0.001.

Figure 2: PSRO variant performance comparison. SHOR-PSRO (brown line) significantly outperforms static baselines on complex games like 6-sided Liar's Dice.

Asymmetric design for training and evaluation:

- Mixing factor λ: Training 0.3 → 0.05 (annealing), Evaluation fixed 0.01.

- Diversity reward: Training 0.05 → 0.001 (decay), Evaluation 0.0.

- Return strategy: Training average strategy, Evaluation last iteration strategy.

- Internal iterations: Training 1000 + 20×(population size-1), Evaluation 8000 + 50×(population size-1).

2.5 Full game test results

Figure 3: CFR variant performance across all 11 games. VAD-CFR achieves or surpasses SOTA in 10/11 games.

Figure 4: PSRO variant performance across all 11 games. SHOR-PSRO achieves or surpasses SOTA in 8/11 games.

Two Papers Summary

Comparison across dimensions:

Core problem: Paper 1 focuses on how collaboration naturally emerges; Paper 2 focuses on how to automatically discover effective algorithms.

Key insight: Paper 1 shows contextual learning substitutes explicit meta-learning; Paper 2 shows LLMs can evolve non-intuitive symbolic algorithms.

Methodology: Paper 1 uses decentralized MARL with diversity training; Paper 2 uses evolutionary algorithms with LLM code generation.

Validation environments: Paper 1 uses Iterated Prisoner's Dilemma; Paper 2 uses Kuhn Poker, Leduc Poker, Goofspiel, Liar's Dice.

Practical significance: Paper 1 provides scalable path for Foundation Model multi-agent systems; Paper 2 shifts algorithm design from manual tuning to automated discovery.

Paper links:

Discovering Multiagent Learning Algorithms with Large Language Models

Multi-agent cooperation through in-context co-player inference

Recent Google Publications: Two Notable Papers on Multi-Agent Systems

Related Articles

分享網址