True Hack! MIT New Research: Zero Architecture Changes, Unlocking Million-Level Context for Large Models

Wen Le from Aofei TempleQuantum Bit | Official Account QbitAI

Let large models easily handle ultra-long texts that are two orders of magnitude longer than their own context windows!

The MIT CSAIL research team has proposed a new method for long-text processing called Recursive Language Model (RLM) to solve the context decay problem.

Without modifying the model architecture or upgrading module designs, it enables top-tier models like GPT-5 and Qwen-3 to process ultra-long texts with millions of tokens in their reasoning layers.

The core idea is to not directly feed the prompt into the large model's context window, but to "outsource" it to an interactive Python environment, allowing the model to actively decompose tasks and process them on-demand through automatic programming and recursive calls.

Huh? Can large models also perform recursive operations when reading context?

Insufficient Context Window, Still Able to Reason

First, let's talk about the painful problem of context decay.

No matter how large a large model claims its context window is, when processing ultra-long texts, it encounters the problem that the longer the text, the more blurred the model's memory of early information becomes, causing reasoning performance to plummet.

This is like reading a million-word novel; by the time you reach the second half, you've long forgotten the key plot points from the first half.

Currently, mainstream solutions include context compression, Retrieval-Augmented Generation (RAG), or architectural-level optimization of the model.

For example, GPT-5.2-Codex uses native context compression within the window to maintain full context information during large code repository assistance tasks lasting several weeks.

At the same time, native integration of RAG functionality in enterprise versions of GPT series, Claude, and Qwen is also an industry consensus.

Examples of architectural-level optimization include the widely speculated ring attention in Gemini 3.

The current RLM differs from these methods that directly "grind hard" on the model; it "outsources" context processing.

RLM sets up an interactive Python programming environment REPL for the model.

Before processing the context, it first starts the Python REPL interactive programming environment and stores the ultra-long prompt as a string variable in the environment;

Then, the model writes code like a programmer, performing operations such as keyword filtering, local exploration, and logical splitting on the text variable, reducing the intake of invalid information through the interactive loop of "writing code - observing results";

Subsequently, the model decomposes complex tasks into several subtasks, recursively calling itself or lightweight submodels to process the split text segments, and all subtask outputs are stored as new variables and fed back into the REPL environment;

Finally, the main model writes code to read and integrate all subtask result variables, performing logical stitching or semantic processing to form the final output.

The entire process is autonomously decided by the model, achieving on-demand processing and completely decoupling the input text length from the model's context window.

Experiments show that RLM effectively handles scales exceeding 10 million tokens, surpassing the native context windows of cutting-edge models like GPT-5 by two orders of magnitude.

In complex long-text tasks, RLM's advantages are also significant. Facing the OOLONG-Pairs task, which requires aggregating paired information and has complexity growing quadratically, the F1 scores of base GPT-5 and Qwen3-Coder are less than 0.1%;

After adopting the RLM solution, the two models achieved F1 scores of 58.00% and 23.11%, respectively.

In the BrowseComp-Plus (1K) multi-document reasoning task with a scale of 6 to 11 million tokens, RLM (GPT-5) achieved an accuracy rate as high as 91.33%, significantly outperforming other long-text processing solutions;

Even in the OOLONG task, which requires linear scanning and processing of almost all information, RLM achieved double-digit performance improvements.

From the perspective of invocation costs, at the 50th percentile, RLM's cost is at the same level as, or even lower than, other long-text processing solutions.

This indicates that in most conventional task scenarios, RLM has a significant advantage in cost-effectiveness.

However, in high percentile ranges like the 95th percentile, RLM's cost shows a significant surge.

This is mainly because RLM's reasoning process is dynamic; it autonomously determines the number of code writing, text splitting, and recursive calls based on task complexity, and additional steps increase the number of API calls.

Finally, to highlight a key point, RLM is a general reasoning strategy that does not touch the model architecture, meaning that, in theory, any model can directly adopt it.

Paper address: https://arxiv.org/abs/2512.24601Reference link: https://x.com/MatthewBerman/status/2012701592756383893

True Hack! MIT New Research: Zero Architecture Changes, Unlocking Million-Level Context for Large Models

Insufficient Context Window, Still Able to Reason

Related Articles

分享網址