Google and Cornell's New Research: The Next Step for LLMs Is Learning How to 'Sleep' Well

After deployment, large language models (LLMs) typically enter a near-"frozen" state. They can execute tasks they mastered during pre-training, but they struggle to continuously absorb new knowledge like humans can. This creates a long-standing paradox:

Stopping learning leads to knowledge gradually becoming obsolete, while continuously fine-tuning parameters often causes "catastrophic forgetting," where learning new abilities weakens or overwrites existing ones. On the other hand, retraining a model from scratch with massive pre-training is prohibitively expensive in both computational power and engineering costs.

While in-context learning (ICL) offers a lighter-weight way to adapt, it is fundamentally limited by the context window: once a session ends, the model's "memory" of what was discussed disappears. This phenomenon resembles human anterograde amnesia, where patients can retain old, long-term memories but cannot form new ones. Every moment feels like it's happening for the first time. Current Transformer-based LLMs exhibit a similar trait: knowledge is either crystallized in pre-training parameters or exists transiently in the current context activation, with a stable connection mechanism perpetually missing between the two.

To address this, a research team from Google and Cornell University has proposed the Sleep paradigm. This is a continuous learning framework inspired by human sleep mechanisms, designed to let the model gradually accumulate and integrate new knowledge without destroying its existing capabilities.

Paper link: https://arxiv.org/pdf/2606.03979

According to the paper, Sleep consists of two phases: Memory Consolidation (analogous to human slow-wave sleep, or NREM) and Dreaming (analogous to rapid eye movement sleep, or REM).

Experimental results show that the Sleep paradigm delivers consistent performance improvements across tasks including long-context understanding, knowledge integration, few-shot reasoning, and continuous learning.

The Sleep Paradigm: Redefining Continuous Learning for LLMs

The Sleep paradigm's starting point is to redefine the lifecycle of continuous learning. In traditional machine learning frameworks, a model's lifecycle is strictly divided into a training phase and a testing phase. In a continuous learning scenario, this boundary does not exist. The model is always learning, but the way it learns alternates between two modes:

Active (Wake) Phase: The model receives external input, performs inference or in-context learning. Knowledge is temporarily stored in the Attention module and high-frequency MLP layers in a short-term, rapidly updating manner.
Sleep Phase: The model stops accepting new external data and instead focuses on consolidating internal knowledge and self-improvement. Sleep is not a passive idle state but a highly dynamic computational process.

The research team further breaks down the Sleep process into two sub-phases, corresponding to the different functions of slow-wave sleep and REM sleep in the human brain.

Diagram comparing traditional machine learning (separation of training/testing) and continuous learning (alternating Wake and Sleep)

1. Memory Consolidation: Parameter Expansion and Knowledge Seeding

The core goal of the memory consolidation phase is to transfer short-term, fragile memories stored in high-frequency (rapidly updated) modules into more stable, low-frequency parameters, while preventing interference between the two types of knowledge.

Why does direct migration lead to forgetting? One of the root causes of catastrophic forgetting is limited parameter capacity—writing new knowledge inevitably overwrites old knowledge. Inspired by the neuroplasticity of the human brain, the research team designed a progressive parameter expansion mechanism:

In each Sleep step, a new low-rank expert module (parameterized by two low-dimensional matrices) is added to the low-frequency MLP block receiving the knowledge (represented by an MoE structure). This module is dedicated exclusively to storing the new knowledge to be transferred. The parameters of existing experts are completely frozen during this process, ensuring that old knowledge remains undisturbed.

After Sleep ends, the low-rank parameters previously added in the high-frequency block are reset and cleared, freeing up capacity for future use. This step is highly analogous to synaptic pruning in the human brain, where redundant connections are actively deleted after memory consolidation to improve efficiency.

After parameter expansion is complete, the memory transfer itself is achieved through Knowledge Seeding (upward distillation). Unlike conventional knowledge distillation, which moves from a larger teacher to a smaller student, Knowledge Seeding distills from a smaller "teacher" model (the state of the current high-frequency module) to a larger "student" model (the expanded low-frequency module).

This design faces two specific challenges: First, the student has a stronger expressive capacity than the teacher, so training directly on teacher-generated data could lead to suboptimal use of the student's parameters. Second, the Sleep phase cannot, in principle, access external datasets, making the assumptions of mainstream distillation methods invalid.

To overcome this, the research team introduced a reinforcement learning-based imitation learning process called Learning to Imitate (LTI) on top of the Generalized Knowledge Distillation (GKD) framework. The entire Knowledge Seeding objective has two components: First, on-policy distillation, where the student receives token-level feedback from the teacher's logits on its own generated sequences to ensure direct knowledge transfer; second, the LTI process, where the teacher generates a batch of synthetic text ("dream data"), randomly truncates the prefix, and asks the student to complete the continuation. A reward is then calculated based on a weighted combination of the semantic similarity between the student's output and the teacher's original text (scored by a frozen reward model) and the edit distance (Levenshtein distance).

The role of LTI is this: having knowledge alone is not enough; the student must also learn how to use that knowledge just as the teacher does.

2. Dreaming: RL-Driven Self-Improvement

After memory consolidation, Sleep enters its second phase, Dreaming, which corresponds to the process in human REM sleep where the brain actively synthesizes new connections. The goal of this phase is to recursively improve the model's own abilities using self-generated synthetic data, without introducing manual annotations.

How is synthetic data generated? Given a sampling task (containing context C and an evaluation metric τ), the model, during MoE routing, deliberately selects an additional random expert to participate in the computation. This design intentionally introduces noise from irrelevant knowledge, with the goal of mimicking the creative mixing of memories during dreams, allowing the model to explore knowledge combinations it would not normally activate. This process generates m candidate "dream" samples.

How are valuable dreams selected? The research team introduced a gradient-based importance score: for each dream sample, the gradient norm of the language modeling objective with respect to the current parameters is calculated and used as a proxy indicator for that sample's potential to improve the model's abilities. The top-k samples with the highest scores, plus a few random samples (to maintain diversity), form the final training set. For each selected dream, the experiment performs supervised fine-tuning with LoRA on an independent model instance. If the fine-tuned model shows improved performance on a downstream task, the corresponding dream receives a positive reward, and the entire generation process is optimized using the ReSTEM algorithm. Compared to SEAL's original design, the research team made targeted improvements in two areas: the sampling strategy (random expert routing) and sample filtering (gradient-based scoring), to control the risk of catastrophic forgetting triggered by iterative self-training.

Experimental Results

In their empirical evaluation, the research team systematically analyzed the independent contributions of each phase of the Sleep paradigm, as well as the overall gains from multi-phase synergy. The specific results are as follows:

In class-incremental learning tasks, the team used three intent classification datasets—CLINC, Banking, and DBpedia—with Llama-3B and Llama3-8B as backbone models. They compared ICL (no Sleep), Elastic Weight Consolidation (EWC), an external continuous learner (InCA), and the Hope baseline, which lacks an explicit distillation mechanism.

The results showed that Hope, which incorporates Sleep, achieved the highest accuracy across all three datasets. Compared to ICL, which relies solely on prompt-level adaptation, Sleep could transform temporary adaptive abilities from the context into persistent parametric memories. Compared to the Hope baseline without explicit distillation, the addition of a self-distillation mechanism further improved the quality of the abstracted knowledge extracted by the model.

In experiments on the impact of memory hierarchy on ICL, the team used three long-context benchmarks—MK-NIAH (from RULER), LongHealth, and QASPER—to systematically investigate the effect of the number of Sleep phases (i.e., the depth of the memory hierarchy) in Hope on model performance, comparing against methods like ICL, DuoAttention, and Cartridges.

Two consistent patterns emerged from the results: performance on long-context tasks improved continuously as the number of consolidation phases increased. At the same time, when the update frequency of the lowest-frequency memory module was increased to make it more adaptable, overall performance declined instead. This indicates that a persistent memory's core value lies in its stability. Overall, Hope outperformed both DuoAttention and Cartridges on all three tasks.

In an experiment on sequentially acquiring two new languages, the team used two low-resource translation datasets, MTOB and Manchu, to have the model learn two new languages it had never seen in pre-training, one after the other. They then evaluated both language abilities simultaneously in the final stage.

The results showed that standard ICL experienced a significant collapse in its translation ability for the first language after learning the second, nearly regressing to the pre-training baseline. Hope-3 (with three levels of Sleep) preserved the vast majority of the acquired abilities, with its ChRF score after continuous learning being close to the level of independent single-language training. In contrast, Cartridges and supervised fine-tuning (SFT) also suffered catastrophic forgetting on at least one language, and thus their results fell outside the effective range of the paper's main chart.

On the BABILong ultra-long-context reasoning task, facing a benchmark of up to 10 million tokens, Hope achieved a near-perfect score. By comparison, GPT-4 and GPT-4o-mini saw their performance drop rapidly once the context length exceeded 128K to 256K tokens. Llama-8B + RAG similarly degraded noticeably as the context grew. Comparable small models like Titans and ARMT showed significant performance deterioration beyond 1 million tokens.

On mathematical reasoning tasks, the team used Qwen3-1.7B and Qwen3-8B as base models and compared training methods like SFT and GRPO on three math competition benchmarks: AIME-24, AIME-25, and HMMT-25. The results showed that Sleep with Qwen3-8B achieved a score of 79.2 on AIME-24, surpassing OPSD's 76.6 and GRPO's 76.4. With Qwen3-1.7B, it also achieved a score of 53.2, higher than GRPO's 51.0.

In a knowledge fusion experiment, the team used the SQuAD dataset to evaluate the model's ability to internalize new knowledge into its parameters under closed-book question-answering conditions. In a single-paragraph setting (n=1), Sleep (with four levels of memory) achieved 48.9; in a continuous pre-training setting (n=200, corresponding to 974 related questions), it achieved 46.2, both outperforming SEAL's scores of 46.7 and 43.2 respectively. Further ablation studies showed that removing the Dreaming phase caused accuracy in the single-paragraph scenario to plummet from 48.1 to 35.7, indicating that the self-improvement phase plays a crucial role in knowledge internalization.

In a few-shot abstract reasoning experiment, the team used Llama-3.2-1B as a backbone model and evaluated on 11 screened training tasks and 8 held-out tasks. Ultimately, Sleep achieved a success rate of 80%, significantly exceeding SEAL (72.5%), TTT (which only performed synthetic updates without Dreaming, achieving 10%), and ICL (0%).

Limitations and Future Directions

Naturally, this research still has some limitations.

The first is at the level of efficiency. According to the paper, under the same number of steps, the runtime of SFT is about 4 times faster than Sleep. However, if the goal is to achieve equivalent performance, the situation reverses: SFT requires roughly 3.6 to 4.8 times more actual wall-clock time to catch up to Sleep. Even so, Sleep's overall computational overhead is still significantly higher than standard control methods, placing practical limitations on its application in scenarios that emphasize rapid iteration and low-cost deployment.

Secondly, the researchers also point out that iterative self-training, if not controlled properly, can itself induce catastrophic forgetting. This is a key reason why the Dreaming phase introduced a gradient-based sample filtering mechanism and a random expert routing strategy. However, the stability of this mechanism over long-term cycles still lacks systematic validation. For instance, the paper does not provide sufficient experimental results on whether the model can continue to stably suppress forgetting and maintain the consistency of its knowledge structure after dozens of Sleep cycles.

Meanwhile, the current solution has a strong dependency on the MoE architecture. Designs for parameter expansion, memory isolation, and multi-level update frequency control are all built upon a sparse mixture-of-experts structure. The paper does not delve into how Sleep could be equivalently adapted for traditional dense models that do not support expert routing.

More importantly, the Sleep paradigm points to a larger question: An LLM's lifecycle perhaps should not end at the conclusion of pre-training.

The human brain continuously reconstructs memories during sleep, gradually consolidating scattered, short-term experiences into stable, hierarchical long-term knowledge. What Sleep attempts is precisely to transfer this mechanism into a model's parameter system, providing LLMs with a continuous learning path that requires no extra manual annotation while avoiding capability degradation as much as possible.

As key issues related to parameter capacity management, distillation stability, and multi-frequency memory scheduling are further advanced, models with the capability for periodic self-integration may become a crucial foundational component for the next generation of long-lifecycle AI systems.

For more technical details, please refer to the original paper.