The Sunstone community is an open-source ecosystem integrating "AI Large Model Development Services + Algorithms + Computing Power." Welcome to follow!
The Tencent Hunyuan team has just released the Hunyuan Wuxiang Architecture (HY-WU), enabling large models to learn how to generate exclusive parameters in real-time during inference, achieving second-level brain switching.
This is a brand-new functional memory paradigm that allows large models to retain their original capabilities when facing new tasks.
By generating personalized parameters in real-time, it completely breaks the limitations of traditional static weights.
Model Memory Requires Dynamic Generation
Large models have always faced a thorny problem on the path of continuous evolution: learning new knowledge leads to forgetting old skills. This phenomenon is known in academia as catastrophic forgetting.
Imagine a top chef who has spent years mastering Chinese stir-frying. When he starts intensive practice in French pastry baking, he returns to the Chinese kitchen only to find he can no longer master even the most basic heat control. The parameter space of large models is like the chef's muscle memory.
Traditional fine-tuning techniques or PEFT (Parameter-Efficient Fine-Tuning) attempt to forcibly cram all new skills into the same brain region. This cover-style repeated erasing and rewriting easily leads to gradient conflicts between old and new knowledge.
Not only forgetting, but models also face the dilemma of personalized trade-offs. The needs of different users and different fields vary greatly. After strengthening rigorous programming logic, large language models often exhibit a dilemma of neglecting one thing while attending to another in divergent thinking or specific style generation.
In the field of image editing, this seesaw effect is equally obvious. Enhancing denoising capabilities often damages the model's retention of artistic styles. Facing personalized demands from thousands of different faces, forcibly fitting all distributions with a single shared parameter can only result in a mediocre outcome that compromises all parties.
Current mainstream solutions have all hit the ceiling of the static weight paradigm. LoRA reduces training costs, but during inference, all samples still share the same set of fixed parameter updates. The one-size-fits-all approach is powerless when dealing with highly heterogeneous tasks.
RAG (Retrieval-Augmented Generation) injects background information into the model through external storage. This merely changes the content the model reads. When the core of the task lies in processing rules rather than supplementing facts, simply adding context cannot fundamentally change the model's internal computation logic.
Training independent LoRA adapters for each task seems to avoid conflicts. However, this brings exponential explosion in storage overhead.
MoE (Mixture of Experts) models call different expert networks through routing mechanisms, but this is still a zero-sum game within a limited parameter space.
Facing these pain points, the Tencent Hunyuan team pinpointed the core issue.
The core of the adaptation problem lies not in the optimization algorithm itself, but in the underlying design of the memory interface. They proposed the HY-WU (Hunyuan Wuxiang) paradigm.
HY-WU introduces a brand-new concept of functional memory. This paradigm no longer pursues finding a universal fixed parameter point; instead, it learns a powerful parameter generator.
The entire adaptation process becomes a pipeline that synthesizes specific operator weights in real-time based on input conditions.
The model can perform dynamic routing within the weight space according to different specific instances.
This completely avoids repeated erasing and mutual interference on shared parameters.
Image Editing Validates Paradigm Leap
The team chose text-guided image editing as the first stress test field.
Image editing naturally exposes various limitations of static weights. Different editing instructions often represent completely mutually exclusive transformation directions in the parameter space.
Restoring old photos requires extreme denoising and color restoration. Aging new photos requires adding noise and fading filters. Forcing a single static adapter to learn both tasks puts the model in an awkward situation where it achieves neither.
Similarly, stylistic instructions perform differently on different images. Turning a cat into a cyberpunk style versus turning a landscape painting into a cyberpunk style involves huge differences in the underlying pixel transformation logic. Extreme sensitivity to samples is a major characteristic of image editing.
HY-WU abandons the old path of storing data and moves towards a new world of storing operator mappings. Functional memory is no longer a fixed knowledge point; it has evolved into a dynamic conditional mapping mechanism.
The framework includes a parameter generator based on the Transformer architecture. This generator does not memorize fixed weight values. It specializes in learning how to fabricate the most suitable operator weights for specific instances.
The model first keenly perceives the current input image and editing instructions. It fuses this information into mixed conditional features. The generator then calculates a set of exclusive LoRA parameters in real-time during inference based on these features.
This astonishing customization process takes only a few seconds on a base model with billions of parameters. The freshly baked personalized parameters are immediately mounted onto the frozen base model. A precise editing transformation without any historical baggage is thus completed.
Previous parameter generation work mostly required collecting massive model checkpoints to assist training. HY-WU adopts an extremely elegant end-to-end training mode. It completely eliminates dependence on historical snapshots.
Addressing the massive parameter generation demand at the billion-level, the research team designed a decomposed self-attention mechanism. This ingenious design significantly optimizes computational efficiency. The model's computational complexity is effectively controlled.
From the perspective of functional memory, the adaptation goal is upgraded to learning a mapping network from conditions to parameter updates. The team defines this as a Conditional Update Family.
This method induces a structured parameter manifold. The generated parameters exhibit a semantically structured beauty in the weight space. Functionally similar editing operations, such as animal deformation or style transfer, automatically cluster in adjacent regions.
This geometric consistency in weight space confirms the success of functional memory. The system can resolve interference by routing to different regions of the update family when facing conflicting goals. It no longer needs to compromise performance by making concessions.
The engineering deployment of the entire system also demonstrates high flexibility.
It does not need to store hundreds or thousands of LoRA weights for loading at deployment time.
The separated mounting of functional memory ensures both personalization and extreme storage lightweighting.
Evaluation Data Confirms Technical Strength
Researchers applied this technology to a native multimodal base model, HY-Image-3.0-Instruct. This model has a massive scale of 80 billion parameters, with 13 billion activated parameters.
To support complex image editing, the team introduced a Transformer parameter generator with 8.11 billion parameters.
This network can generate 16-rank LoRA weights with 720 million parameters for all linear modules. This endows the model with extremely high flexibility and accuracy.
HY-WU shines in many practical scenarios such as social games and advertising. In personalized try-on and face-swapping scenarios, it demonstrates astonishing feature consistency.
Comparison between HY-WU and Seedream 4.5, GPT Image 1.5, and Nano Banana 2. In personalized scenarios such as try-on, fitting, and face-swapping, HY-WU demonstrates stronger feature consistency, fully showcasing its adaptation capabilities and providing users with more imagination space.
The research team constructed an extremely rigorous evaluation system. The test covers two major tracks: single-image and multi-image editing. It includes 346 single-image sets and 64 multi-image editing pairs.
The test involves 60细分 editing sub-tasks and supports bilingual instructions in Chinese and English. The evaluation objects gather the strongest first-tier model camps currently on the market. OpenAI's GPT-Image-1.5 and Google's Nano Banana Pro are included.
In human evaluations representing real user perception, HY-WU's performance is remarkable. Data shows it significantly outperforms all mainstream open-source models. Its sensory quality is only slightly inferior to the top-tier closed-source model Nano Banana.
The results on automated benchmarks also confirm its hardcore strength. In the GEdit-Bench Chinese test, it ranked first among open-source models in three major dimensions: semantic consistency overall score and perceptual quality.
Its six core metrics even surpassed closed-source models Seedream 4.5 and Nano-Banana-Pro.
In the 9细分 editing tasks of ImgEdit-Bench, it won first place in 5 tasks and second place in 1 task among open-source models.
Its total score ranked second among all public models. The gap with the closed-source hegemon GPT Image 1.5 is negligible, only 0.11 points.
HY-WU is not only suitable for native multimodal models but also brings significant performance leaps on traditional MMDiT (Multimodal Diffusion Transformer) architectures. It perfectly follows the scaling law growth logic.
As the depth of the parameter generator increases, model performance continues to climb. By expanding the rank of LoRA and increasing the generated parameter scale from 120 million to 470 million, the model performance shows a clear positive correlation growth trend.
Intelligent Architecture Moves Towards Functional Modularity
The Hunyuan team's exploration does not stop at the field of image editing. They have painted a grand blueprint for future AI centered on functional neural memory. The paradigm of large model architecture is undergoing profound changes.
Retrieval memory is responsible for storing factual knowledge. Functional memory is responsible for storing transformation logic. The two form a perfect complementary relationship in their operation mechanisms.
Call retrieval memory when factual details and specific examples are needed. Activate functional memory when complex transformation rules and precise process control are required. This provides solid underlying support for the flexible response of operators.
In the long run, functional memory will completely solve the problem of online continuous learning. When the system processes a continuous stream of new tasks, it can safely write new skills into the blank areas of the update family. The old capability matrix will not suffer any irreversible damage.
Mere accumulation of backbone network parameters is not the only path to Artificial General Intelligence. Jointly scaling the backbone model with functional memory modules is more computationally and data-efficient than simply expanding a single model.
Functional memory allocates conditional operator capacity. Rare or highly conflicting behaviors no longer need to be forcibly solidified in shared weights. The model's conflict robustness and personalization capabilities will achieve a qualitative leap.
This paradigm has broad generalization potential in the cross-modal field. Video models often face huge balancing pressures when processing temporal attention layers. Introducing functional memory allows the model to generate dynamic operator offsets for specific action sequences.
Visual question answering and multimodal interaction tasks need to process highly heterogeneous input signals. Functional memory can real-time and precisely adjust the parameter weights of cross-modal fusion layers based on the specific proportion of input modalities.
In long sequence generation or complex agent interactions, maintaining identity consistency is a world-class challenge. Functional memory can be used to specifically store identity operators.
When the system identifies a specific entity, the generator instantly synthesizes a set of exclusive parameter constraint networks. The character's facial details and material textures will remain stable throughout long-span generation across scenes. This completely eliminates the risk of feature drift.
Shifting computational pressure from static weights to dynamic parameter generation poses new challenges for hardware inference. Dynamically generated parameters can easily lead to fragmentation of memory access patterns.
Developing customized operator fusion technologies designed for dynamic LoRA weights is particularly critical. This can significantly reduce the time overhead caused by parameter switching. Deep integration with high-performance inference engines will further optimize collaboration efficiency.
Reducing the latency and power consumption of parameter generation is the final hurdle for the implementation of this technology. When personalized real-time adaptation for thousands of different faces can run smoothly on end-side devices like mobile phones, intelligent computing will truly integrate into daily life.
Completely releasing model parameters from static constraints may be the necessary path to stronger intelligence.
References:
https://tencent-hy-wu.github.io/
https://github.com/Tencent-Hunyuan/HY-WU
END