The MLNLP Community is a renowned domestic and international hub for machine learning and natural language processing, serving graduate students, university faculty, and industry researchers worldwide. Our vision is to foster communication and progress among academia, industry, and enthusiasts in NLP and machine learning, with a special focus on helping beginners advance.
Source | AI Technology Review
Author | Zheng Jiamei
Editor | Cen Feng
You may have encountered a situation where a model performs excellently when handling a single task, but as new tasks are continuously added, its performance begins to degrade. It doesn't fail completely; rather, it becomes unstable, with certain capabilities declining and results fluctuating. The issue isn't that the model doesn't know how to perform; rather, what it originally knew gets partially "crowded out" by the newly added tasks.
Behind this lies a rarely clarified problem: a model's capabilities are not stored in independent blocks but share the same internal representation space. Simply put, all tasks are "using the same storage area for information."
As the number of tasks increases, they don't exist side-by-side; instead, they compete for the most critical positions within that space. Whoever occupies more space remains more stable; whoever gets crowded out sees their performance drop. This is why multi-task fusion often doesn't mean "the more, the stronger," but rather "the more, the messier."
In reality, this problem is ubiquitous. For instance, in a continuously iterating system where new capabilities are constantly added, each addition might seem like just "doing one more thing," but it actually involves reallocating the internal representation space. Without a good mechanism, new capabilities often impair old ones, turning the system into something that requires constant patching rather than natural expansion.
Against this backdrop, the team led by Geng Xin from Southeast University proposed the paper "Model Merging in the Essential Subspace". Instead of pursuing more complex parameter fusion, they shifted focus to a more critical question: "Where exactly is the important information within the model?"
In recent years, the "Learning Gene" concept proposed by Geng Xin's team (sharing core critical model parameters across multiple tasks) has offered new solutions to this problem. The research team discovered that a model's effective capabilities are not uniformly distributed across all parameters but are concentrated in a few key directions—these directions are what truly determine task performance.
Once understood this way, the problem becomes clear. Multi-task fusion fails not because parameters aren't merged well, but because these key directions overlap and conflict. Thus, this research boils down to two steps: first, separating the important directions of different tasks as much as possible to avoid mutual encroachment; second, ensuring more important information is retained while less important parts are suppressed. In this way, different tasks can coexist stably within the same model.
From this perspective, what this work truly changes is not just the method itself, but the way we look at the problem. It transforms model merging from simple parameter manipulation into a question of how information is allocated and how it coexists, allowing multi-task systems to potentially "grow" more capabilities without mutual interference, rather than just constantly stacking them.
Paper Link: https://arxiv.org/pdf/2602.20208
Model Fusion Failure: The Root Cause is Subspace Conflict
From the experimental results, what the research team truly aims to demonstrate is not that their method偶然ly achieves higher scores in one specific setting, but that as the difficulty of multi-task fusion increases, ESM (Essential Subspace Merging) maintains more stable performance.
Traditional methods often suffer significant performance slumps as tasks increase; the more tasks, the stronger the mutual interference, ultimately leading to a rapid decline in fusion effectiveness. In contrast, ESM performs more stably. While performance loss in other baselines typically reaches 8% to 9%, ESM's loss amplitude is significantly smaller, with an overall loss rate about 20% lower than the baseline. This indicates that ESM is better at resisting the mutual interference caused by increasing multi-task loads, preserving more of the effective knowledge originally belonging to each task.
When considering model scale together, the research results reveal another point. On larger models, where various methods can already achieve scores above 90, the improvement of ESM over existing strong methods narrows to approximately 0.3 to 0.5.
This does not mean ESM's effect weakens; rather, it shows that large models inherently possess stronger representation capacity and more natural subspace separation capabilities. Therefore, the essence of ESM's value is artificially constructing a clearer subspace decoupling mechanism, which is more critical when model capacity is limited.
The research also provides a crucial upper and lower bound reference. The performance of un-fine-tuned models is approximately between 50% and 65%, while single-task fine-tuned expert models reach above 90%. ESM achieves between 81% and 91%.
This indicates that ESM is not just slightly better than the baseline but is significantly approaching the ideal upper bound of multi-task fusion. In other words, ESM is quite close to the goal of "maintaining single-task efficiency even after merging multiple tasks."
To explain why it performs better, researchers conducted ablation studies. Simply changing the decomposition method from SVD to ESD increased performance from 89.0 to 90.9, an increase of 1.9, indicating that subspace selection is the primary key to the problem.
Afterward, adding Polarized Scaling further improved performance from 90.9 to 91.8, an additional 0.9 increase. This suggests that ESD mainly addresses information loss, while Polarized Scaling primarily resolves information competition. In other words, the entire method is effective because it simultaneously handles the two core issues of "what information to retain" and "how to make this information coexist."
Looking further into the internal mechanism, the study found that ESD retains more effective information even when keeping fewer components, whereas traditional SVD requires retaining more dimensions to achieve similar effects. This shows that ESD finds a more concentrated and efficient information representation; truly critical task knowledge is not uniformly distributed across all directions but concentrated in a few functionally stronger directions.
The study also found that even when retaining only 5% of the components, the fusion model obtained by ESD still maintains higher feature consistency with the original expert models. This indicates that what ESD retains is not the superficial parameter structure, but something closer to the task semantics and the model's behavior itself.
Regarding data dependency, the research team provided strong evidence. Whether using normally sampled data, biased data containing only a single category, or completely unrelated external distribution data, the results showed almost no significant difference.
Meanwhile, performance already exceeded the baseline with just 1 sample and approached optimality with only 4 samples, basically converging after increasing to 32 samples. This indicates that the task subspace itself is a low-dimensional structure that does not require large amounts of data to estimate; the model internally already encodes stable task response patterns.
Finally, regarding Polarized Scaling, the study shows that amplifying strong signals alone improves performance, and suppressing weak noise alone also improves performance, but combining both yields the best results. This suggests that multi-task fusion is essentially not simply averaging parameters, but more like a process of signal filtering and re-weighting.
Reconstructing Knowledge Boundaries Within the Model
In the experiments, the research team systematically controlled three core variables to test which combination comes closest to "lossless fusion."
The first variable is the subspace construction method, comparing SVD and ESD. The former is built on parameter space, while the latter is built on output space; this is the core comparison of the entire study.
The second variable is the fusion method, comparing direct concatenation versus orthogonalization. The former is more direct, while the latter attempts to eliminate correlations between different task representations. The third variable is weight allocation, comparing unweighted fusion versus norm-based scaling. The core question here is whether different task information should be treated equally during merging.
The research team also designed the data with strong 针对性 (specificity). The selected tasks cover vastly different types such as images, text, and numbers, for example, Cars, SUN397, SST2, and MNIST. The goal was not to pursue task richness itself, but to maximize the differences and conflicts between tasks. Only if the method remains effective under such high-heterogeneity conditions can it be said that the researchers truly solved the interference problem, rather than achieving local improvements only on similar tasks.
At the same time, when constructing proxy data, the researchers deliberately kept the data scale very small, using only 32 unlabeled samples per task. This design was not just to save computation; more importantly, it was to verify whether the subspace comes from the model's intrinsic structure or merely from statistical results of the data. Subsequent experimental results confirmed that what the researchers extracted is indeed closer to the task structure already formed inside the model.
To ensure that different tasks have relatively fair expression capacity during fusion, the research team designed a rank allocation strategy, assigning dimensions to each task such that k = total dimensions / number of tasks. This step is essentially about fair resource allocation, because without such constraints, strong tasks would easily occupy more representation space, while weak tasks might be drowned out.
Entering the fusion stage, researchers found that simple concatenation brings two direct problems: first, different task subspaces may overlap; second, this overlap triggers information conflicts. Therefore, an orthogonalization step was added, the essential function of which is to force different task subspaces to be as independent as possible. This process is conceptually very close to PCA whitening or signal decorrelation.
Finally, regarding weight adjustment, the research team further discovered from experiments that high norms often correspond to more important parameter changes, while low norms are closer to noise. Thus, they designed a rule where scaling ∝ (norm / average)^2. Furthermore, they implemented this scaling across three levels: at the task level to prevent certain tasks from being drowned out, at the dimension level to highlight more critical feature directions, and at the layer level to reduce interference caused by residual structures.
Multi-Task Coexistence Mechanism in Limited Space
Multi-Task Coexistence Mechanism in Limited Space
Overall, the value of this research lies not only in proposing a stronger model fusion method but in advancing model fusion from parameter splicing to the level of knowledge structure reorganization. Many past methods assumed that the key to merging multiple models was whether parameters could be averaged well. However, this study shows that what truly needs to be retained are the key functional directions the model relies on when processing inputs.
In other words, the researchers have redefined the model fusion problem. The focus is no longer just on the parameters themselves, but on how model capabilities exist and coexist within the representation space.
This research also clarifies multi-task interference more clearly. In the past, everyone knew that having too many tasks easily drags each other down, but understanding often stopped at the phenomenon of "conflict." The research team further pointed out that interference mainly stems from two causes: first, different tasks encroach upon similar representation directions; second, strong and weak information compete during fusion, resulting in important knowledge being drowned out by noise.
The significance of ESD is separating the core directions of different tasks as much as possible, while the significance of PS (Polarized Scaling) is amplifying more important signals and suppressing less important parts. Therefore, what this research truly accomplishes is connecting the sources of interference and the paths to resolution into a complete explanation.
Looking at a deeper level, the research results also reveal a very important property inside deep models: although task knowledge exists in high-dimensional parameters, the changes that truly determine performance are often concentrated in a few directions, and this structure is not sensitive to specific data.
This indicates that the interior of large models is not chaotic; rather, there exists a low-dimensional structure that can be refined, compressed, and recombined. The significance of this discovery is huge, as it means that future model improvements may not always require bigger data and longer training; capabilities can also be enhanced by understanding the existing knowledge organization methods within the model.
The impact of this research on ordinary people is also very realistic. It means that future AI systems are more likely to integrate multiple capabilities into a single model without repeated retraining, and the integration will be more stable, less likely to damage original capabilities just by adding a new function.
For ordinary users, this will make AI tools more like comprehensive general assistants rather than many fragmented small tools. For enterprises and platforms, this may also reduce deployment costs and computational consumption, ultimately 体现 (manifesting) in cheaper services, faster responses, and intelligent functions that can run locally on more devices.
Therefore, the truly important aspect of this research is not just achieving slightly higher results, but proving that model fusion can move from empirical parameter processing to an understanding and reorganization of knowledge structures. This not only advances academic understanding of internal model mechanisms but will also affect how ordinary people use AI in the future.
The Researchers Behind ESM
The corresponding author of this paper is Geng Xin, a Chief Professor at Southeast University, Executive Deputy Dean of the Graduate School of Southeast University, and Director of the Key Laboratory of New Generation Artificial Intelligence Technology and Interdisciplinary Applications of the Ministry of Education.
He obtained his Bachelor's and Master's degrees from Nanjing University in 2001 and 2004, respectively, and his Ph.D. from Deakin University in Australia in 2008. Since then, he has been engaged in teaching and scientific research at Southeast University for a long time and founded the Pattern Analysis and Machine Learning (PALM) Laboratory.
In terms of academic achievements, he has long been deeply involved in machine learning, large models, pattern recognition, computer vision, and other directions. He has published more than 230 papers in important international journals and conferences. He has received honors such as the National Science Fund for Distinguished Young Scholars, the National Science Fund for Excellent Young Scholars, the Second Prize of the National Natural Science Award, the First Prize of the Ministry of Education's Natural Science Award, the First and Second Prizes of the National Teaching Achievement Award, the Science Exploration Award, and the First Prize of the Wu Wenjun Artificial Intelligence Natural Science Award. He also serves as the Program Committee Chair, Area Chair for multiple international conferences, and an editorial board member for several journals.
In his research work, he focuses on knowledge representation and reorganization within models. His early representative work concentrated on label distribution learning, advancing traditional single-label or multi-label learning problems to label distribution learning with finer-grained representation. Later, he gradually expanded his research focus to edge-side large models and "Learning Genes," exploring the extraction of inheritable and reusable core capabilities from foundation models to achieve efficient deployment across different tasks and hardware conditions.
Reference Link: https://palm.seu.edu.cn/xgeng/
Another corresponding author is Qi Lei, an Associate Researcher and Master's Supervisor at the School of Computer Science and Engineering, Southeast University. He obtained his Bachelor's degree from Nanjing Normal University and his Master's degree from Nanjing University of Science and Technology. He received his Ph.D. from Nanjing University in 2020 and visited the University of Wollongong in Australia during his doctoral studies.
In terms of academic achievements, Qi Lei has published more than 60 papers in ACM/IEEE transactions and CCF-A class conferences, with over 5,300 Google Scholar citations, and has hosted multiple national and provincial/ministerial level scientific research projects. At the same time, he has been selected for talent programs such as the National Postdoctoral Researcher Funding Plan, Jiangsu Province Outstanding Postdoctoral, and Southeast University Zijin Scholar. He has also received awards such as the CCF Industry-Academia Cooperation Fund Excellent Project Case and the Jiangsu Province Artificial Intelligence Society Outstanding Doctoral Dissertation Award.
In terms of research direction, Qi Lei's work mainly focuses on computer vision and pattern recognition. In recent years, he has mainly concentrated on anomaly detection, semantic segmentation, domain generalization, and vision-language models.
Reference Link: https://palm.seu.edu.cn/qilei/