What to do with poor pre-training data? Bengio team introduces explicit Bayesian for gradient-free In-Context RL

Simply lengthening the context does not automatically elicit reinforcement learning capabilities; introducing explicit Bayesian inference is the key to breaking the deadlock.

In the research boom of In-Context RL, there is often an inertial thinking that as long as the Transformer is made larger and the context window is lengthened, the model can "suddenly realize" the optimal strategy like AD (Algorithm Distillation) or DPT (Decision-Pretrained Transformer).

However, experimental results show that existing In-Context RL methods have significant limitations. They are essentially closer to conditional behavior cloning.

If you feed the model expert data, it can imitate very well. But if the context is filled with suboptimal or even random trajectories (which is the norm in actual applications), the model will often fit these suboptimal behaviors, thereby inheriting policy bias, and it is difficult to surpass the level of the demonstrator.

Recently, the team led by Yoshua Bengio at the Mila lab released a new work SPICE. This work did not blindly struggle with the number of model parameters, but instead elegantly combined Deep Ensemble, Bayesian inference, and Transformer.

The core insight of SPICE is not to treat the pre-trained model merely as an action predictor, but as a tool that provides a "value prior".

At Test-time, this prior is fused with contextual evidence through an explicit Bayesian formula, and decisions are made using the UCB (Upper Confidence Bound) algorithm.

Even in the case of extremely poor pre-training data quality, SPICE is theoretically proven to have a Logarithmic Regret bound and shows significantly better performance than baseline models like DPT in experiments.

Paper Title: In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior

Paper Link: https://arxiv.org/pdf/2601.03015

When In-Context RL encounters "Bad Data"

The current In-Context RL paradigm (such as DPT) usually performs supervised learning on a large number of offline trajectories, with the goal of fitting.

This brings a tricky problem—Behavior-Policy Bias.

If the generation policy of the pre-training data is suboptimal, or carries strong noise, the model trained based on MLE (Maximum Likelihood Estimation) will inherit this bias.

In the inference stage, if the historical data in the Context is also suboptimal, it is difficult for the model to infer the optimal solution through its own inductive bias, and it is difficult for the simple Attention mechanism to generate exploration (Exploration) behavior out of thin air.

To achieve true reinforcement learning at Test-time (i.e., be able to explore and improve the strategy), we need two key elements, which are exactly what the existing Transformer architecture lacks:

1. Explicit estimation of Q values: not just predicting action probabilities;

2. Uncertainty quantification: knowing where you don't know, thereby driving exploration.

Methodology: Prior, Evidence and Fusion

SPICE stands for Shaping Policies In-Context with Ensemble prior. Its architecture is not complicated, and the core lies in how to elegantly handle the relationship between pre-training knowledge and current context.

Figure 1. Overview of SPICE training and inference. The left side is the training phase, learning the Value Ensemble; the right side is the inference phase, extracting evidence through Kernel and performing Bayesian fusion.

SPICE's workflow can be decomposed into three steps:

1. Training Phase: Learning Value Priors

SPICE still uses Causal Transformer as the Backbone network, but its output Head is no longer a simple Policy Head, but is attached with N Value Heads (Ensemble learning).

For a given Query state s, these N heads will output N Q value estimates. We use these estimates to construct a Gaussian distribution as a Prior:

Here, u is the prior mean, and o^2 is the prior epistemic uncertainty.

Key details: Weighted representation shaping and Bayesian contraction

To make the Transformer's Latent Space more suitable for value function estimation, the authors designed a very particular auxiliary Policy Loss:

The weight w here is the product of three factors, corresponding to importance sampling, advantage weighting, and epistemic uncertainty weighting respectively:

(Importance Sampling): Correcting behavior policy bias.

(Advantage): Giving higher weight to samples with High Advantage, making the model focus on "good" actions.

(Epistemic): Giving higher weight to areas with high uncertainty (large Ensemble variance), forcing the model to learn places where it is unsure.

In addition, to ensure that the distribution output by the Value Head has good calibration, the authors also introduced a Bayesian contraction loss when training the Value Ensemble, which constrains the predicted value to shrink towards the posterior mean during the training phase, laying the foundation for the Bayesian update at test time.

2. Inference Phase: Extracting Contextual Evidence

At Test-time, SPICE requires no gradient updates. Faced with a Context (historical interaction trajectory), SPICE regards it as evidence.

Since the state s' in the Context may be different from the current Query state s, using statistics directly is not feasible.

SPICE uses the Latent Feature extracted by the Transformer (rather than the raw state s, because z contains structured information shaped by the Policy Loss), and calculates similarity weights through a Kernel function (such as RBF kernel):

Using this weight, we can calculate the "weighted count" n'(a) and "weighted average target value" u'(a) for each action a near the current state:

Here y' can be a single-step Reward (Bandit setting) or n-step TD Target (MDP setting).

3. Bayesian Fusion and Decision Making

The key to SPICE's breakthrough lies in that it assumes Q values follow a Gaussian distribution and uses the property of Normal-Normal Conjugacy to directly obtain the posterior distribution of Q values.

The precision of the posterior (Precision, i.e., the reciprocal of variance) is equal to the sum of prior precision and data precision:

The posterior mean is a weighted combination of the prior mean and the data mean:

Where lambda = tau_0 / (tau_0 + tau_D).

Figure 2. Detailed architecture of SPICE: showing the complete chain from Latent Feature to Prior Ensemble, and then combining Kernel Evidence to generate Posterior.

After obtaining the posterior distribution Q(a|s), SPICE uses the Posterior-UCB strategy for exploration during online interaction:

This formula intuitively explains SPICE's behavior:

If there is no relevant data in the Context (n'=0), the posterior falls back to the prior, and the model relies on pre-training knowledge.

If Context evidence is sufficient, the posterior variance o^2 will rapidly decrease, and the mean will correct to the true observed value, thereby getting rid of the pre-training bias.

The beta sqrt(o^2) term ensures continued exploration of uncertain actions.

Theoretical Guarantee

For researchers focusing on theory, SPICE provides a very rigorous conclusion.

The paper proves that in Bandits and Finite-Horizon MDPs, SPICE's Regret Bound satisfies:

Note the two terms on the right:

The first term is the standard O(sqrt(T)) regret bound, which means that SPICE has the same optimal asymptotic convergence rate as the classic UCB algorithm.

The second term is a constant term (Warm-start term), which depends on the quality of the pre-training prior e_0.

This means that even if the pre-trained model (prior) has a large bias, it only adds a constant level of Regret, and will not cause Regret to grow linearly with time like DPT.

As long as there is interaction during testing, SPICE will eventually converge to the optimal policy.

Experimental Results: Significantly Superior to DPT

To verify the adaptability under "bad data", the authors designed very harsh experimental conditions in Bandits and Darkroom (2D navigation) environments.

Especially in the Darkroom experiment, the labels of the pre-training data use the "Weak-last" setting, that is, the labels are not the optimal actions, but the last step actions of random policy trajectories. This is basically suboptimal data with huge noise.

1. Bandit Experiments: Reject Linear Regret

Figure 3. Bandit Performance Evaluation. SPICE achieves the lowest cumulative regret in online settings, while DPT exhibits linearly growing regret.

From the figure above, in Online settings, SPICE achieves the lowest cumulative regret among all learning-based methods, closely following the performance of the classic UCB algorithm.

In contrast, DPT's final regret value is two orders of magnitude higher than SPICE, indicating that DPT failed to adapt from weak log data, while SPICE truly achieved In-Context policy improvement.

2. Robustness: Fearless of Noise

Figure 4. Robustness to reward noise. As the test-time noise sigma increases, DPT's Regret remains high, while SPICE remains stable.

The results above show that with the increase of reward noise in the test environment, the performance of SPICE, Thompson Sampling, and UCB only undergoes tiny absolute changes, maintaining good robustness.

On the other hand, DPT's final regret value remains high and is almost insensitive to noise changes, further confirming its adaptability defect under suboptimal data training.

3. MDP Experiments: A Qualitative Leap from Zero to One

Figure 5. Darkroom (MDP) experiment results. In the extreme case of only "Weak-last" labels, DPT's returns are almost zero, while SPICE can quickly learn and obtain high returns.

In sequential decision-making tasks like Darkroom, experimental results show that SPICE can quickly adapt to the environment and obtain high returns, and its regret curve flattens rapidly after a brief warm-up.

In comparison, DPT and AD-BC exhibit near-linear regret growth in this weak supervision setting, with returns almost zero.

This indicates that methods lacking uncertainty quantification find it difficult to jump out of the trap of imitating suboptimal behaviors when facing "bad data".

Conclusion

The SPICE paper did not blindly stack Transformer parameters, but returned to the essence of reinforcement learning—value estimation and uncertainty quantification.

By introducing Deep Ensemble and Bayesian fusion, it cleverly solved two core difficulties in In-Context RL:

1. How to utilize suboptimal data? Treat it as a prior, not as the truth.

2. How to implement test-time exploration? Use posterior uncertainty to drive UCB, rather than simple imitation.

The pseudocode for Algorithm 1 is also very concise, making it very suitable as a Baseline for subsequent research.

Figure 6. SPICE algorithm pseudocode. Clearly shows how to combine Transformer encoding with closed-form Bayesian updates.

Of course, SPICE also has its limitations. Currently, it relies on the Kernel function to measure state similarity. In high-dimensional or partially observable (POMDP) environments, designing a good Kernel still poses challenges.

What to do with poor pre-training data? Bengio team introduces explicit Bayesian for gradient-free In-Context RL

Related Articles

分享網址