In recent years, Vision-Language Models (VLMs) have made rapid progress in multimodal understanding and reasoning tasks. From mathematical reasoning and scientific question answering to complex visual understanding and cross-modal inference, mainstream models typically improve overall capabilities by continuously increasing scale and deepening layers. Under this development path, a premise that is almost tacitly accepted is that every layer in the model is meaningful, and together they constitute an indispensable component of the model's capabilities.
However, in the practice of specific tasks, whether this assumption always holds has not been systematically verified. Professor Yang Shuo's team from Harbin Institute of Technology (Shenzhen) noticed that in certain downstream tasks, the failure modes exhibited by the model do not appear to be due to insufficient capability, but rather seem to be constrained and interfered with by internal computation paths. This observation leads to a seemingly simple yet extremely challenging question: In pretrained vision-language models, are there certain layers that do not play a positive role in specific tasks, and may even systematically suppress model performance?
Centered on this question, Yang Shuo's team at HIT(Shenzhen) discovered the counter-intuitive phenomenon of "Task-Interfering Layers" in vision-language models, and based on this proposed TaLo, a training-free test-time layer intervention method to unlock the model's potential capabilities on specific tasks.
Paper: https://arxiv.org/abs/2602.00500
Code: https://mikuz12.github.io/Do_All_Individual_Layers_Help
Authors: Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang
Institutions: Harbin Institute of Technology (Shenzhen), Harbin Institute of Technology, Southeast University, Central South University, National University of Singapore, Hong Kong University of Science and Technology (Guangzhou)
Background and Motivation
In the design of mainstream vision-language models, different downstream tasks typically share the same fixed hierarchical computational structure, with inference defaulting to executing all Transformer layers completely. This unified computational path is engineering-wise simple and efficient, but also means the model cannot perform targeted adjustments to intermediate computation processes based on task differences.
From existing research, analyses of model hierarchical structure mostly focus on "layer importance" or "layer-wise degradation sensitivity," with conclusions often showing a monotonic decline in overall performance as layers are removed. However, such analyses rarely focus on a more fine-grained question: Under specific task conditions, could certain intermediate layers introduce information routing inconsistent with task objectives, thereby affecting final decisions?
Based on this motivation, Professor Yang Shuo's team at HIT(Shenzhen) approached from the perspective of layer intervention to conduct layer-by-layer probing of the model's internal information routing paths, attempting to characterize the actual dependency relationships of different tasks on intermediate layer computation, and providing a basis for subsequent structured analysis and test-time adjustment.
Figure 1: Performance comparison across multiple task metrics after intervening on specific model layers in different vision-language models. Each subplot corresponds to Qwen2-VL-2B, LLaVA-NEXT-8B, and InternVL-40B respectively. Different colors represent results from intervening on different layers, while dashed lines indicate the original model. It can be observed that across different models, selecting appropriate layers for intervention can simultaneously outperform the baseline on multiple tasks, indicating this phenomenon has cross-model consistency.
Discovery of Task-Interfering Layers Phenomenon
To quantify the impact of a single layer on a specific task, the authors adopted an experimental paradigm of layer intervention: intervening on each layer individually and comparing task performance changes before and after intervention. If performance increases after a layer is intervened, this indicates that the layer exhibits an "interference effect" on that task.
In implementation, the paper focuses on intervening on the self-attention submodule in the LLM backbone, while preserving residual connections to avoid overall model collapse. Two typical intervention forms are:
Parameter Zeroing: Setting the parameters of the specific layer attention to zero, causing its attention path to approximately fail (while the residual path is preserved).
Uniform Scaling: Reducing the attention operation to a global average of input features (used as another intervention method to cross-verify with zeroing).
Across multiple models and benchmarks, researchers observed significant gains in many tasks after "skipping a certain layer," indicating that such layers are not "non-contributing," but rather actively limit the model's potential performance on specific tasks. The authors named these Task-Interfering Layers.
Experimental results show this phenomenon is not accidental. In models of different scales and architectures such as LLaVA, Qwen-VL, and InternVL, numerous tasks showed significant performance improvements after "skipping" a specific layer.
Taking LLaVA-Next-8B as an example, after conducting layer-by-layer interventions on over 30 tasks, more than half of the tasks showed performance improvements exceeding 5% when at least one layer was intervened. This suggests that layer representations formed during pretraining are not necessarily consistently suitable for all downstream tasks. The authors refer to layers that suppress performance in specific tasks as "Task-Interfering Layers."
Figure 2: Heatmap of performance changes relative to pretraining baseline for each task category after applying interventions layer-by-layer in the LLaVA-NEXT model. The horizontal axis represents the intervened layer number, the vertical axis represents task categories, and colors represent performance change magnitudes. It can be observed that different tasks show distinct sensitivity patterns to layers, and for most tasks there exist a few layers whose intervention can bring significant performance improvements, further indicating the existence of task-related interfering layers in the model.
Structured Relationship Between Tasks and Layers
After confirming the widespread existence of task-interfering layers, the research focus shifted to a more fundamental question: Do these interfering layers appear randomly, or do they have internal correlations with the capability requirements of tasks themselves? To this end, the research team proposed the "Task-Layer Interaction Vector" representation to systematically characterize different tasks' sensitivity to various model layers. After confirming the widespread existence of interfering layers, the paper further asks: Is this interference random noise, or is it related to task capability structure? To this end, the authors propose the Task-Layer Interaction Vector, transforming "task sensitivity to each layer" into a computable and comparable vector.
For a model with L layers, given task T, its interaction vector is defined where each dimension is defined as the performance change after intervening on the corresponding layer.
Intuitively, if the value is positive, it indicates that intervening on this layer improves accuracy, so the layer exhibits "interference" for the task; if negative, the layer makes a positive contribution to the task.
With this representation, the authors use correlations between task interaction vectors to characterize whether "tasks with similar capabilities will show similar layer sensitivity patterns." For different tasks, the researchers calculate the correlation coefficient and perform clustering and visualization based on distance.
The research team conducted clustering experiments on 6 comprehensive datasets covering over 100 tasks. Results showed that mathematics reasoning, scientific reasoning, perception, and other related tasks naturally form different clusters, indicating that the interfering layer phenomenon has a structural correspondence with task capability requirements. That is, tasks with similar requirements on model capabilities have "similar" Task-Layer Interaction Vectors, and this interfering layer phenomenon is not accidental fluctuation.
Furthermore, the authors validated the stability and reliability of the clustering through metrics, with results showing good reliability. (Specific clustering results can be found in the paper appendix or project homepage.)
Figure 3: Visualization of clustering results based on Task-Layer Interaction Vectors
Additionally, to exclude "artifacts caused by intervention implementation details," the paper also compared the consistency between zeroing and uniform scaling interventions across a large number of task-layer pairs, showing significant positive correlation. This suggests that task-interfering layers are more likely to be stable internal properties of the model rather than products of specific intervention methods.
Figure 4: Correlation plot of different intervention methods
TaLo: A Test-Time Task-Adaptive Mechanism
Based on the above discovery, the research team further proposes TaLo (Task-Adaptive Layer Knockout), a task-adaptive method that operates during the testing phase, serving as an "operational verification" of the task-interfering layer phenomenon: if certain tasks are indeed hindered by specific layers, then automatically identifying and skipping these layers during testing should bring stable benefits. TaLo is designed to be training-free and plug-and-play, with the key feature of not updating parameters or introducing additional modules.
The TaLo method consists of three main steps:
First, given an L-layer model, sample a small-scale probing set from the target task, and obtain the baseline score on the original model.
Then apply interventions layer-by-layer (the paper mainly uses zeroing), obtaining the intervened model, and calculate the gain brought by this layer.
Finally, select the layer that brings the maximum positive gain as the candidate interfering layer for this task.
When there is no obvious positive peak (below threshold), TaLo retains the original model without modification; when finding the optimal layer, TaLo skips this layer in subsequent inference for this task, and reports final performance on independent test samples.
Since it has already been proven that task-layer sensitivity patterns are structural and transferable, using a small number of probing samples to locate interfering layers often brings continuous benefits on subsequent samples of the same task.
Experimental Validation and Performance
The research team conducted comprehensive evaluations on three mainstream VLM architectures (LLaVA, Qwen-VL, InternVL), using answer accuracy as the core metric. Evaluations were conducted on 5 benchmark datasets across different domains and different few-shot settings. Results show that TaLo brings stable and consistent performance improvements on the vast majority of tasks.
On the LLaVA model, whether in 10-shot, 15-shot, or 20-shot settings, TaLo achieved positive gains on multiple benchmarks including MMStar, MMBench, MMMU, ScienceQA, and SEEDBench, with average performance improvements remaining stable across different shot numbers. Similar trends also appeared in the Qwen-VL model. Although different tasks show varying sensitivity to layer interventions, TaLo still achieved positive gains on most benchmarks, with a maximum performance improvement of 16.6% on the ScienceQA Maps task.
This result indicates that appropriately bypassing specific layers can effectively alleviate internal interference that the model suffers in complex reasoning scenarios. On the larger-scale InternVL model, TaLo also demonstrated good generalization capabilities. On multiple reasoning and understanding tasks, TaLo consistently outperformed the original model under different few-shot conditions, with average performance improvements, showing that this method does not depend on specific model scales or architectures.
Notably, all the above performance benefits come entirely from structural interventions during the testing phase, without introducing any parameter updates, additional training, or external supervisory signals. This means TaLo provides a lightweight, stable, and reusable test-time adaptive mechanism that can unlock the model's potential capabilities on specific tasks while keeping model parameters unchanged.
Performance of LLaVA and Qwen-VL on multiple datasets and different tasks:
Performance of InternVL model on multiple datasets and different tasks:
Comparison with Different Methods:
Further comparisons show that in low-shot settings, TaLo outperforms various common parameter-efficient fine-tuning methods in both efficiency and effectiveness. This result indicates that in certain task scenarios, simple and precise structural adjustments may be more effective than complex parameter learning.
It is worth noting that TaLo only requires forward propagation; for larger models, PEFT requires many forward and backward propagation steps, leading to much higher resource demands. For a 40B model, inference requires approximately 50GB of GPU memory, while LoRA fine-tuning methods reach 80GB of memory usage even with a batch size of 4.
Ablation Experiment 1: Impact of Different Intervention Methods on TaLo Performance
The research team adopted three common layer intervention methods: direct parameter zeroing, uniform scaling, and mean substitution. Experimental results show that zeroing and uniform scaling produce similar effects, but parameter zeroing achieves better average performance across various tasks. Mean substitution performs poorly, and sometimes when using this intervention method, no corresponding Task-Interfering layer can be found.
Ablation Experiment 2: TaLo Method with Multi-Layer Search
To supplement the single-layer intervention-based TaLo method, the authors studied multi-layer TaLo interventions. For each task, first use the standard TaLo process to determine the optimal single layer. Then, intervene on this layer, and iteratively apply a second zeroing intervention to every other layer in the LLM backbone, measuring the resulting performance change while keeping all other components unchanged.
This produces a complete pairwise intervention matrix for each task, from which the best two-layer combination is selected. Since computational cost grows quadratically with model depth, exploration is limited to two-layer combinations as a tractable proxy for higher-order interactions. Results show that the benefit of adding a second intervention is very limited and greatly increases consumed resources, so the authors maintain the single-layer intervention design.
Analysis Experiment 1: Robustness of TaLo Layer Selection
A natural question is whether the "task-interfering layer" selected by TaLo during the probing phase is merely a result triggered by few samples or evaluation noise. To verify the robustness of TaLo's layer selection, the authors further conducted repeated experiments through bootstrap sampling of the probing set, finding that selected interfering layers are highly concentrated, and layer selection results do not fluctuate significantly with sample perturbations. Furthermore, the paper shows cross-benchmark transfer validation: interfering layers selected on MMBench's logical reasoning tasks, when directly applied to MMStar's math tasks, still bring positive gains, while layers selected for perception tasks consistently harm math reasoning performance. This consistency across tasks and benchmarks indicates that TaLo identifies not "layers that are accidentally optimal for a particular benchmark," but rather layers related to task conditions with stable interfering capabilities, thereby supporting the robustness of TaLo's layer selection process in both statistical and practical applications.
Analysis Experiment 2: Qualitative Case Studies
Beyond validation of overall performance and statistical stability, researchers further conducted qualitative analysis of TaLo's behavior through specific cases. It can be seen that without layer intervention, the model provides seemingly reasonable but actually incorrect answers. The source of error is not basic arithmetic or missing common sense, but rather the introduction of irrelevant or conflicting information in intermediate reasoning paths, causing final judgments to deviate from correct conclusions. After applying TaLo, the model can more stably focus on key information related to the task on the same input, and output results consistent with standard solutions. This improvement does not come from a more "complex" reasoning process, but rather the opposite: by suppressing interfering layers under specific tasks, the model's intermediate reasoning steps become more concise, coherent, and more aligned with the logical sequence humans use when solving problems. These cases intuitively demonstrate that TaLo does not "inject new knowledge," but effectively avoids inappropriate information routing within the model during the testing phase, thereby improving the reliability and consistency of reasoning results.
Implications and Significance
TaLo's research results reveal a noteworthy fact: Large-scale pretrained models inevitably contain task-to-task representational conflicts. Certain layers may be reasonable compromise solutions under overall pretraining objectives, but can become performance bottlenecks in specific downstream tasks. By performing targeted suppression of these layers during the testing phase, the model can instead focus more on the capabilities genuinely required by the task.
From a broader perspective, this work not only proposes a practical test-time adaptive method, but also provides a new perspective for understanding the internal functional organization of vision-language models. It prompts us that in the era of large models, performance improvement does not always depend on "more parameters" or "deeper structures." Sometimes appropriate simplification can instead unlock potential capabilities masked by the model.