Injecting Continuous New Knowledge into Large Models: Beihang's CASE Framework Edits Thousands of Times Without Forgetting, Adding Less Than 1MB of Parameters | WWW'26

"Starbucks has a new CEO," "Latest research results released"...

When Large Language Models (LLMs) need to continuously absorb new knowledge, they often fall into two dilemmas after multiple updates:

Either they forget previously learned content due to parameter update conflicts, or they attach a massive number of parameters to avoid forgetting, leading to excessive computational resource consumption.

The latest CASE framework proposed by the team at Beihang University offers a solution: it "scores" each edit, storing conflicting knowledge separately while sharing space for non-conflicting data; simultaneously, it tunes only the "key neurons" most sensitive to the current knowledge, preventing irrelevant parameters from being skewed.

This method effectively addresses the core pain point of the "Lifelong Model Editing" task for LLMs. The research paper, titled "CASE: Conflict-assessed Knowledge-sensitive Neuron Tuning for Lifelong Model Editing," has been accepted by the prestigious international conference WWW 2026 (The ACM Web Conference 2026).

Diagram illustrating the CASE framework concept

Experiments show that after 1,000 consecutive knowledge edits on an LLM, CASE improves average accuracy by nearly 10% compared to existing state-of-the-art methods, while maintaining parameter efficiency with less than 1MB of additional parameters.

The "Dilemma" of Lifelong Editing: Why Do Existing Methods Forget Frequently After Multiple Updates?

"Knowledge aging" and "factual hallucinations" in large models are nothing new, but the goal of "Lifelong Model Editing" is even more demanding: enabling LLMs to continuously learn new things or correct knowledge like humans, without losing previously edited knowledge or interfering with unrelated capabilities.

Existing mainstream methods have consistently failed to overcome two major challenges:

"Blindly Adding Parameters": To adequately preserve pre-trained knowledge, existing large model editing methods typically use additional parameters for knowledge updates. However, during multi-batch lifelong editing processes, existing methods either unlimitedly add new parameter subspaces based on a fixed batch count, leading to excessive occupation of computational resources; or they cram a large amount of knowledge into the same space regardless of whether these updates conflict with the model, resulting in "catastrophic forgetting."

"Indiscriminate Parameter Tuning": During specific knowledge updates in each batch, existing methods only locate knowledge-related parameters at the "layer-wise" level, thereby indiscriminately updating all neurons in that layer for different pieces of knowledge. This causes the gradients of the "key neurons" that should be the focus of adjustment to be dispersed. Instead, gradient conflicts on locally irrelevant neurons accumulate gradually, causing forgetting to become more severe as the number of edits increases. The CASE team points out that the root cause of these issues lies in existing methods ignoring the quantification of "editing conflicts" between different pieces of knowledge—they neither calculate whether two knowledge updates contradict each other nor identify which neurons need tuning.

Comparison of existing methods vs CASE

Core Breakthrough: Breaking the Deadlock with "Conflict Quantification" + "Sensitive Tuning" Dual Modules

Overview of the CASE architecture

The key to the CASE framework is adding a "conflict assessment brain" and a "precision tuning tool" to lifelong editing, with two core components working together to solve global and local conflicts:

1. CAA Module: "Scoring" Editing Conflicts to Rationally Allocate Parameter Space

The core of the Conflict-Assessed Editing Allocation (CAA) module is "quantifying conflicts and allocating on demand." For each piece of new knowledge to be edited, drawing on gradient theory from multi-task learning, it uses gradient directions to represent the update trend of knowledge on the model. It first calculates whether the new knowledge contradicts previous parameter subspaces, then decides whether to share space or create a new one.

How is this done specifically? The team designed two key metrics to measure the update direction of new knowledge (x_t, y_t) and previous parameter subspaces relative to the original model, respectively:

Update Direction of Parameter Subspace (Eⁱ_t-1): Measures the degree to which the existing i-th subspace deviates from initial weights after t-1 previous edits, reflecting the knowledge this space has "remembered." This is obtained by calculating the difference between the subspace parameter matrix ΔWⁱ_t-1 and the model's initial subspace ΔW⁰₀:

Formula for Update Direction of Parameter Subspace

Edit Gradient (G_t): Calculates the loss gradient matrix of new knowledge (x_t, y_t) on the model's initial subspace, representing the update direction and magnitude of the new knowledge on the model.

Formula for Edit Gradient

Then, through cosine similarity:

Formula for Cosine Similarity

The system "scores" the "editing conflict" and allocates subspaces according to the following rules:

Conflict scoring logic part 1 Conflict scoring logic part 2

If cⁱ_t ≥ 0: The new knowledge is compatible with the existing knowledge in the subspace; it directly shares this space to avoid subspace fragmentation.
If cⁱ_t < 0: A conflict exists between the two; a new subspace is created to isolate them, preventing "old knowledge from being washed away."

This design fundamentally solves the problem of "blindly dividing space"—it neither forces conflicting knowledge to crowd together nor allows the number of subspaces to spiral out of control, naturally significantly reducing routing difficulty during inference.

2. KNT Strategy: Tuning Only "Key Neurons" to Eliminate Local Conflicts

The Knowledge-sensitive Neuron Tuning (KNT) strategy focuses on "precision tuning." Instead of updating all subspace parameters, it only identifies the neurons "most sensitive" to the current knowledge, refining knowledge localization from "layer-wise" to "neuron-wise," thereby avoiding instability in the parameter space caused by updating irrelevant parameters.

The team uses the Fisher Information Matrix (FIM) to "measure neuron sensitivity": a higher Fisher value indicates that small changes in this neuron have a greater impact on the model's prediction, making it a "key node" for current knowledge. To balance efficiency, they use a diagonal approximation of FIM (significantly reducing computation), and then dynamically set a threshold via the entropy of the gradient distribution to generate a "sensitive neuron mask M_t"—allowing only high-sensitivity neurons to participate in the update.

Formula for Sensitive Neuron Mask

Additionally, KNT incorporates knowledge activation regularization: it quantizes and stores historical knowledge activation values (converting float32 to int8, reducing storage by 75%). During updates, it uses KL divergence to constrain the difference between new and historical activation values, ensuring that "old knowledge does not drift" after tuning.

Knowledge activation regularization process

One could say that while fine-tuning is like "reshaping cognition" for the model, KNT is like "precisely tuning" key neurons—it fixes the problem at hand without disrupting the overall rhythm.

Experiments: 10% Lead in Accuracy After 1,000 Edits, Compatible with Multiple Models

To verify the effectiveness of CASE, the team conducted comparative experiments on two core tasks. Baseline models included LLaMA2-7B, Qwen2.5-7B, and LLaMA3-8B-Instruct, while comparison methods covered mainstream lifelong editing frameworks such as GRACE, WISE, and MEMIT.

Experimental setup and baseline comparison table

1. QA Task (ZsRE Dataset): No "Drop-off" After 1,000 Edits

In the ZsRE lifelong knowledge editing task requiring continuous updates to entity relationships:

At 100 edits, CASE's editing accuracy on LLaMA2-7B was 5 percentage points higher than the second-best method, with locality (unrelated knowledge preservation rate) reaching 100%.
After 1,000 edits, while most existing methods saw a significant drop in accuracy (e.g., WISE dropped from 90% to 77%), CASE maintained 95% accuracy. This is 10% higher than the second-best method and only a 3% drop from the 100-edit mark—achieving virtually "no forgetting after a thousand edits."

Notably, while GRACE maintains high accuracy, its generalization is extremely poor (only 26%), meaning it can only memorize entity relationships by rote. In contrast, CASE achieves 82% generalization, capable of handling unseen similar questions.

2. Hallucination Correction (SelfCheckGPT Dataset): Perplexity Reduced by 60%

In tasks correcting the model's "nonsensical outputs," CASE performed even more prominently:

On LLaMA2-7B, after 1,000 edits, CASE's perplexity (a metric for measuring text factual consistency, where lower is better) dropped from 3.12 to 1.22, 60% lower than the second-best method.

On Qwen2.5-7B, while other methods saw perplexity skyrocket due to accumulated conflicts, CASE was the only method able to stably maintain low perplexity.

3. Efficiency Advantage: Fewer Parameters, Faster Inference

CASE's parameter efficiency far exceeds similar methods: additional parameters are less than 1MB (WISE requires 86MB), and inference time per iteration is only 10.72 seconds, almost no different from an unedited model. This means it can be easily deployed in real-world scenarios.

Efficiency comparison chart

Analysis Experiments: Stability of CASE Under Different Settings

The team tested the stability of CASE under different parameter settings. Overall, CASE maintains stable editing performance across a range of hyperparameter values, adapting to scenario requirements without complex tuning.

Hyperparameter sensitivity analysis

From the experimental samples below, it can be seen that CASE only fails in extremely rare specific cases.

Failure case examples

Additional failure case visualization

As large models are deployed in finance, healthcare, law, and other fields, "continuous knowledge updating" has become a rigid demand: for instance, updates to medical guidelines, revisions of legal statutes, and changes in corporate information all require the model to keep up timely without losing previous professional knowledge.

Previously, such needs were met either by "full fine-tuning" (high cost, long cycle) or "RAG + prompts" (unstable effects). By breaking through lifelong model editing technology, CASE provides a potentially superior future solution:

No need to retrain the model; achieves lightweight updates through "conflict quantification allocation + sensitive neuron tuning."
Supports thousands of continuous edits, suitable for large models serving long-term.
Compatible with mainstream open-source LLMs (LLaMA, Qwen, etc.), with low migration costs.

The team stated that they will further explore the application of CASE in multi-modal models and unstructured data editing, enabling the "lifelong learning" capabilities of large models to cover more scenarios.

We are currently recruiting an Academic Editor Intern who is quick-witted and focused on AI. 🎓

Interested candidates are welcome to follow 👉 Learn More

Recruitment QR Code

🌟 Set as Starred 🌟
See daily progress in tech frontiers.