(Reading time: 13 minutes)
We have grown accustomed to having large models answer questions, generate code, and write reports. But when they are truly expected to serve as "intimate personal assistants," the hardest task turns out to be the most mundane one—remembering.
They need to remember the allergy medication you mentioned three months ago, the new city you moved to last week, the 30-page project report you handed them yesterday...
To tackle this challenge, a research team from Microsoft Research Asia, Microsoft AI, and Renmin University of China has proposed RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory), a brand-new benchmark designed specifically to evaluate the "realistic, heterogeneous, and dynamic long-term memory capabilities" of large models. Unlike previous static, piecemeal test sets, RHELM for the first time simulates a dynamic virtual life trajectory spanning a full year, constructing a highly realistic "memory examination hall" for large models. The paper and data for this research are now fully open-sourced.
Relevant links are compiled at the end of the article; click to explore more technical details.
Over the past few years, the industry has seen the emergence of multiple long-term memory benchmarks such as LongMemEval, LoCoMo, PerLTQA, and PersonaMem. However, these traditional benchmarks universally suffer from "three structural defects":
First is semantic incoherence and overly "flat" personas. To artificially extend context, many benchmarks forcibly stitch together unrelated dialogue fragments. Such "long conversations" are semantically fractured, and the user profiles behind them are merely a few static labels, failing to capture the basic fact that "a person changes gradually over time."
Second is single information source—only dialogue. In real-world scenarios, an AI assistant faces not just chat logs but also emails, daily reports, project documents, personal diaries, and other structurally diverse texts. These "non-dialogue" materials have higher information density and are closer to real workflows, yet the vast majority of existing benchmarks remain stuck in a "pure chat" setting.
Third is overly "honest" evaluation questions. Existing evaluation questions are mostly "needle-in-a-haystack" fact extraction; as long as the model can find the answer in the history, it passes. But real users often make requests that conflict with their own state—for example, asking for weekend cycling routes while a leg injury hasn't healed, or asking about restaurants near the old house after just moving. A truly "memory-capable" assistant should proactively identify such implicit conflicts rather than mechanically complying.
To solve all three problems simultaneously, RHELM was born. The core idea of RHELM can be summarized as: first create the person, then create the life, and only then create the dialogues and documents. The entire data construction process revolves around three pillars:
User Profile: Researchers defined six dimensions for each virtual user: identity, personality, traits, relationships, belongings, and current status. These dimensions cover the full spectrum "from internal psychology to external reality, from immutable characteristics to instantaneous states," and are stored in a strict JSON Schema to ensure the evolution process is structured and verifiable.
LOOP Module: This module dynamically simulates a user's real-life trajectory over a full year through a four-step loop of "Plan-Rollout-Evolve-Prune" (pLan-rOllout-evOlve-Prune).
Heterogeneous External Sources: At each key node of the life trajectory, researchers used Deep Research methods to synchronously generate matching emails, personal logs, and professional reports, ensuring that dialogues and documents are "time-aligned and content-consistent."
Ultimately, RHELM contains 10 virtual users with diverse profiles, 11,764 dialogue turns, and 2,180 external materials. The context length for a single user reaches 500K–1M tokens, accompanied by 1,305 high-difficulty Q&A questions covering 7 major categories and 27 meticulously defined complex "memory" characteristics.
Figure 1: RHELM benchmark construction process.
Key Engine: How the LOOP Module "Raises" a "Flesh-and-Blood" Virtual Human
The LOOP module is the most distinctive design of RHELM. It ingeniously transforms the task of "generating long-term dialogue" into "simulating a person's real life":
Plan (pLan): The system generates a schedule based on the user profile, including short-term arrangements (social, daily, hobbies) and long-term plans (career progression, life milestones, major transitions).
Rollout (rOllout): For each event in the plan, the system probabilistically rolls out positive or negative outcomes. For instance, a planned cycling trip might complete successfully or be interrupted by a fall, and that "injury" will realistically affect activity scheduling for subsequent weeks.
Evolve (evOlve): Based on the day's rollout results, the system dynamically updates the user profile via JSON Schema function calls. Researchers split this into two parallel tracks: factual evolution (objective attributes like relationships, possessions) and state evolution (internal changes like preferences, habits), ensuring synchronization of external and internal updates.
Prune: The system periodically "recalibrates" the profile, actively removing outdated entities to prevent semantic drift and error accumulation during long-horizon evolution. After each pruning, a new LOOP cycle begins.
Figure 2: RHELM benchmark construction algorithm flow.
It is precisely this probabilistic, event-driven trajectory design that endows RHELM's data with the contingency and long-tail nature of real life—and this is exactly the part current models struggle with most. On top of each LOOP step, RHELM adds a layer of heterogeneous external source generation, converting daily events into formal reports, personal diaries, structured emails, and further enriching details via a Deep Research Agent, giving every document a complexity that is "realistic enough to deceive."
Seven Question Types × 27 Characteristics: Breaking Down "Memory Capability" to Ultra-Fine Granularity
RHELM breaks down the assessment of "memory" into 7 question categories, covering both "pure dialogue" and "heterogeneous source" levels. Dialogue categories include Fact, Temporal, Hallucination, Aggregation, and Misleading—five types; Heterogeneous source categories include Pure External Source QA and Cross-Source Mixed QA—two types.
Figure 3: RHELM challenging question classification (attachments and emails both correspond to external source types).
To rigorously test models, each question is forcibly bound to at least one challenging characteristic (27 in total), including cross-day aggregation, cross-source alignment, vague reference, implicit state constraints, etc. This fine-grained labeling system enables subsequent analysis to pinpoint exactly "which type of detail the model stumbled on."
Most innovative are the Memory-Conditioned Misleading Queries. Researchers deliberately select key turning points in a user's life (chronic injury, relocation, career change) and design "trap requests" that conflict with the user's current state. For example, a user was advised by a doctor to stop running last month due to a knee injury, but this month asks the AI assistant for "weekend routes suitable for long-distance running." A truly long-term-memory-capable AI assistant should not simply comply; it should proactively trace history, identify the conflict, politely point out the issue, and offer alternatives that respect current constraints. This is an evaluation dimension almost untouched by previous benchmarks, and the core pain point RHELM aims to drive the industry to solve.
Researchers systematically evaluated the three mainstream long-context and memory solutions currently on the market on RHELM:
Full-Context Models: GPT-4.1-mini, Gemini-2.5-Flash-Lite, Qwen2.5-14B-Instruct-1M—all natively support million-token contexts.
RAG (Retrieval-Augmented Generation): Based on bge-large-en-v1.5 + FAISS, tested top-k at 5/20/50; also tested GPT-4.1, Gemini-2.5-Pro, Claude Opus 4.5 as generators, plus BM25 + dense hybrid retrieval.
Memory Frameworks: Represented by MemGPT, Mem0, MemU, all using GPT-4.1-mini as the backbone model.
However, the final evaluation results poured a bucket of cold water on the entire industry and diagnosed the underlying bottlenecks of current large models' "long-term memory."
Industry Status: Overall Scores Low, Multi-Source Mixing a Universal Disaster Zone
Actual test data shows that large models' overall scores are universally low. The strongest performer, Claude Opus 4.5, achieved an average score of only 38.1 with external sources introduced, and 36.2 without. This indicates a long road ahead before large models become "reliable personal memory assistants."
Surprisingly, adding external sources is not always beneficial. Once heterogeneous materials like emails, logs, and reports are stuffed into the context, RAG scores on standard question types actually drop (e.g., RAG@k=50 fell from 59.9 to 54.6). This shows existing retrieval mechanisms have not learned to fuse memories across modalities. Consequently, cross-source mixed question types become a universal disaster zone. Regardless of paradigm, scores plummet most severely on questions requiring "dialogue + external source" collaborative reasoning.
Meanwhile, hallucination and misleading questions ruthlessly expose weaknesses. For almost all methods, misleading question accuracy is below 5%; and the more evidence RAG retrieves, the lower the hallucination question score (dropping from 13.2 to 11.2).
In contrast, strong reasoning models show clear advantages. Claude Opus 4.5 and Gemini-2.5-Pro perform significantly better on hallucination and misleading dimensions. This suggests high-level reasoning capability helps models better identify and resist "seemingly plausible" false premises.
Table 1: RHELM performance evaluation results. Two evaluation settings (with/without external data sources) shown side by side. Metrics conceptually grouped into Dialogue History QA (FC: Fact, TP: Temporal, AG: Aggregation, HL: Hallucination, MI: Misleading), External Source QA (EX: Attachments and Emails), and Mixed Context QA (MX: Mixed). Best overall scores in bold, runner-up underlined.
Where Is the Problem? The Ceiling of Retrieval Recall
To further locate technical bottlenecks, researchers compared recall rates of bge-large-en-v1.5, bge-m3, all-MiniLM-L6-v2, and OpenAI series embeddings at different top-k values.
The conclusion is quite pessimistic: even with the retrieval budget relaxed to k=50, the recalled evidence remains limited, far from sufficient to support precise model answers. In other words, in RHELM's long-horizon, heterogeneous, dynamic corpus, traditional methods relying solely on "stacking vector retrieval" can no longer meet the needs of real memory assistants. This finding points the spearhead directly at the underlying architectural design of memory systems, not merely "swap in a stronger embedding model."
Figure 4: Recall rate comparison of different embedding models under varying candidate counts.
The Hardest 10 Characteristics: Where Exactly Do Models Get Stuck?
Researchers further extracted the "worst-performing 10 challenging characteristics" for fine-grained analysis. Results point to two clear "disaster zones":
One is Cross-Source Aggregation. This concentrates in cross-source mixed QA and aggregation questions, where models often confuse information sources or fail to effectively resolve mutually contradictory historical fragments. The other is Real-World Contextual Reasoning. This concentrates in misleading and hallucination questions, where models tend to fabricate non-existent facts or completely ignore the user's actual current state during reasoning.
Figure 5: Analysis of the 10 worst-performing challenging characteristics in RHELM. Models show significantly poor performance on features involving cross-source aggregation and real-world contextual reasoning.
In one sentence: today's memory-augmented models perform "not too bad" on "can they remember," but still hit a clear ceiling on "can they correctly use what they remember."
In retrospect, RHELM's release brings profound insights to the entire industry. As the first long-term memory benchmark to deeply couple dialogue streams with heterogeneous external sources, RHELM not only brings evaluation scenarios truly close to the real daily life of a "personal assistant," but also provides clear capability-dimension handles for subsequent research through 27 finely attributable challenging characteristics. More importantly, via systematic evaluation covering all three paradigms—full-context, RAG, and memory frameworks—it explicitly identifies the critical shortcomings of current SOTA models in cross-source aggregation and real-world contextual reasoning.
Researchers acknowledge that RHELM currently has limitations. For instance, the benchmark still focuses mainly on textual external sources, not yet covering complex multimodal data like video, audio, and tool calls; also, because profile seeds come from PersonaHub's elite subset, the dataset may have biases in professional and educational backgrounds. But these unfinished tasks leave clear room for extension by the open-source community.
If "context window" solves how far a large model "can see," then "long-term memory" determines how deeply it "understands you." RHELM breaks down "long-term memory" finely enough and makes it real enough—it is both a mirror reflecting the industry's current state and a roadmap pointing to the future.
What's worth anticipating next is not merely longer contexts or stronger retrieval algorithms, but a next-generation AI assistant that can truly, like a human, proactively accumulate, evolve, prune, and intelligently invoke "memory."
Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory
Paper link:
https://arxiv.org/abs/2605.31086
Project page:
https://microsoft.github.io/RHELM/
Evaluation code:
https://github.com/microsoft/RHELM
Evaluation dataset: