Large Models Can Now Modify Parameters 'In-Place'! ByteDance Seed & Peking University Paper: Test-Time Inference Requires No Extra Layers or Retraining

By Yu Yang from Aofeisi | QbitAI

ByteDance Seed's latest research enables large models to "modify parameters in-place."

It requires no changes to the model architecture, no retraining, and runs remarkably fast.

Diagram illustrating the concept of In-Place TTT

Here is the situation. In the era of AI agents, it is well known that the tasks facing models are becoming increasingly complex, and the context windows are growing longer.

How to enable large models to learn while working, continuously adapting to new information without collapsing amidst ultra-long contexts, has become a major focus of AI research.

Test-Time Training (TTT) allows models to update partial parameters during inference, but in practical applications, the problems remain highly complex:

First, architectural incompatibility. Existing TTT methods require introducing entirely new network layers or even replacing attention mechanisms, necessitating pre-training from scratch.

Second, low computational efficiency. Current TTT approaches adopt a token-by-token sequential update method, failing to fully leverage the parallel computing capabilities of GPUs/TPUs.

There is also the issue of mismatched optimization objectives. Existing TTT methods mostly employ reconstruction objectives, causing the model to merely "remember the current token" rather than being designed to predict the next one. In other words, this mismatches the core task of language models: "predicting the next token."

Addressing these issues, a research team from ByteDance Seed and Peking University came up with a clever solution:

Instead of adding new layers or altering the architecture, they directly utilize the MLP (Multi-Layer Perceptron) modules already present in the Transformer as the large model's "temporary cerebellum."

This scheme, named In-Place TTT (In-Place Test-Time Training), allows TTT to function as a plug-and-play module, seamlessly integrating into existing pre-trained large models.

Experiments prove that Qwen3-4B, Llama3.1-8B, and Qwen3-14B all became stronger "in-place" after being equipped with In-Place TTT, with particularly significant improvements in long-text tasks.

Performance comparison chart showing improvements in long-context tasks Graph demonstrating efficiency gains

This paper has been accepted as an Oral presentation at ICLR 2026.

Enabling Large Models to "Modify Parameters In-Place" During Inference

Without further ado, let's examine the detailed content of the paper.

The core problem In-Place TTT aims to solve is enabling large models to quietly update themselves and adapt to the current context during inference/question answering, without折腾 (fussing with) the model architecture.

To achieve plug-and-play capability, researchers from ByteDance Seed and Peking University primarily made three innovations.

In-Place Architecture Design

In In-Place TTT, researchers cleverly reuse the ubiquitous MLP (Multi-Layer Perceptron) in Transformers.

They designate the last projection matrix W_down of the MLP as "fast weights," which are updated in-place during inference.

This eliminates the need to introduce new dedicated layers to handle fast weights. Existing pre-trained large models can be used directly without retraining.

Architecture diagram showing MLP modification

Optimization Objective Aligned with Language Models

As mentioned earlier, original TTT methods only made the model "remember the current token," which is inconsistent with the optimization objectives of language models.

To address this, In-Place TTT designs an optimization objective specifically for autoregressive language models:

By introducing a one-dimensional convolution (Conv1D) and a projection matrix, the TTT's target value incorporates information from future tokens, thereby explicitly aligning with the task of "predicting the next token."

Researchers also analyzed and proved that this approach encourages fast weights to compress information useful for future predictions, effectively enhancing the model's in-context learning capabilities.

Efficient Block-Level Update Mechanism

In-Place TTT modifies the MLP while retaining the original attention layers, enabling block-level updates instead of processing token-by-token.

Combined with context parallelism techniques, In-Place achieves higher throughput and computational efficiency, supporting longer contexts.

Diagram illustrating block-level update mechanism

Experiments show that In-Place TTT can significantly improve the performance of existing models (such as Qwen3-4B) in long-context tasks of 128K or even 256K tokens.

In comparisons involving training from scratch, it also outperforms other TTT methods.

Research Team

The first authors of the In-Place TTT paper are Guhao Feng and Shengjie Luo.

Guhao Feng is currently a student at Peking University and an intern at ByteDance Seed.

Shengjie Luo also graduated from Peking University, where he was advised by Professor Liwei Wang and the corresponding author of this paper, Professor Di He.

Another corresponding author of this paper is Wenhao Huang from ByteDance Seed.

Paper URL: https://arxiv.org/abs/2604.06169v1

— End —