Paper Link: https://arxiv.org/abs/2603.11896
Project Link: https://github.com/wl666hhh/Think_While_Watching.git

Addressing the Pain Point: The "Perception-Generation" Mutex Lock in Streaming Video Inference

Although Multimodal Large Language Models (MLLMs) have achieved remarkable results on offline video understanding benchmarks, their performance often falls drastically short in continuous video stream scenarios such as live broadcast analysis, embodied intelligent robots, and real-time security surveillance. Current streaming large models generally adopt an "Interleaved Perception-Generation" paradigm: the model watches a segment of video, stops to generate text, and then watches the next segment.

Diagram illustrating the interleaved perception-generation paradigm

In my view, this design, which forcibly serializes perception and generation, suffers from two fatal flaws. The first is Memory Erosion. In multi-turn Q&A, subsequent questions often rely heavily on early visual cues, but interleaved text decoding interrupts the continuous modeling of long-range temporal features, causing the model to "forget the beginning by the time it reaches the end." The second flaw is severe latency backlog. The paper's authors provide a brilliant theoretical explanation using queuing theory in the appendix: assuming video arrives at a rate of $λ$ and the model processes at a rate of $μ$ (load factor $ρ$ ). During the non-preemptive decoding time $T_{g e n}$ , the system stops receiving video, leading to a backlog of $λ T_{g e n}$ . Even more alarming is that to catch up on this backlog, the system requires a catch-up time of:

This means that as the load factor approaches full capacity, even a generation pause of just a few seconds can trigger a cascading system-wide latency collapse. This "perception-generation" mutex lock is the biggest stumbling block preventing multimodal large models from entering real-world online scenarios.

TWW's Core Solution: Maintaining a Continuous Segment-Level Memory Stream

To break this dilemma, the paper proposes the Think While Watching (TWW) framework. TWW's core insight is that streaming multimodal inference should not be a one-time "read-and-burn" process, but should instead establish a Segment-Level Memory mechanism based on time anchors.

Architecture diagram of the TWW framework showing segment-level memory

Specifically, TWW abandons the crude approach traditional models take by treating the entire video history as an undifferentiated context. As the video stream continues to input, TWW runs silently in the background, actively generating "Memory Notes" for each incoming video segment. These notes extract and compress key entities, action states, and scene transitions within the current segment. When a user suddenly inserts multi-turn questions at any moment, the model does not need to re-traverse the massive original video tokens; instead, it directly calls these structured segment memories for Chain-of-Thought (CoT) reasoning. This mechanism is akin to a human taking knowledge snapshots in their mind while watching a long documentary, ensuring the coherence of long-range dependencies while significantly reducing the cognitive load during multi-turn dialogues.

Bridging the Data Gap: A Three-Stage Synthetic Streaming CoT Instruction Set

With the architectural concept in place, the ensuing challenge is that the open-source community virtually lacks high-quality multi-turn dialogue datasets with "streaming memory annotations." To bridge this training data gap, the authors utilized GPT-5.2 to meticulously synthesize a streaming CoT dataset comprising three stages, designing a phase-matched progressive training strategy.

Illustration of the three-stage synthetic data generation process

Stage 1 (Short Video Single-Turn): Trains the model's ability to extract states and write memory notes for a single video segment.
Stage 2 (Short Video Multi-Turn): Cultivates consistency across multi-turn dialogues, strictly requiring the model to reuse previous memory notes when answering subsequent questions, and absolutely prohibiting peeking into future video segments that have not yet occurred.
Stage 3 (Long Video Complex Reasoning): Introduces YouTube long videos (such as tutorials or lectures lasting dozens of minutes), training the model's ability to recall clues over long ranges and handle uncertainty amidst massive amounts of interfering information.

Notably, during data synthesis and model training, TWW implements extremely strict causality constraints. For an input stream containing $N$ video segments and $M$ questions, the model must precisely generate $M$ reasoning blocks. To fundamentally prevent "peeking into the future" at the underlying mechanism level, TWW introduces a Streaming Causal Mask and Streaming Rotary Positional Embeddings (Streaming RoPE), ensuring that each question query is only visible to visual content up to the current timestamp.

Inference Engineering Optimization: Dual KV Cache and Adaptive Attention

At the engineering implementation level, how can true "thinking while watching" be achieved under limited computing power? TWW provides a very elegant and practical engineering solution in its inference pipeline design: an adaptive pipeline with separated read and write operations.

Diagram showing the dual KV cache and adaptive attention backend

Through a Dual KV Cache mechanism, the system completely decouples the continuous ingestion of the video stream from the autoregressive decoding of text. The video processing thread and the text generation thread can execute concurrently, fundamentally eliminating the $λ T_{g e n}$ latency backlog effect mentioned earlier.

Even more interesting is the Adaptive Attention Backend designed by the authors for different generation stages. Under streaming mask rules, the query length ( $Q_{l e n}$ ) and key length ( $K_{l e n}$ ) of the attention mechanism often change. TWW performs dynamic routing: when performing pre-filling of source video features ( $T_{p r e f i l l}$ ) or standard single-step autoregressive decoding ( $T_{d eco d e}$ ), the system calls the 极致 optimized Flash Attention to pursue maximum throughput; however, when encountering the special streaming Q&A phase requiring irregular custom causal masks, the system seamlessly switches back to Memory-Efficient Attention. This context-aware underlying scheduling ensures that streaming inference satisfies strict temporal causal logic without sacrificing 极致 inference speed.

Stunning Data Performance: Halving Token Consumption While Boosting Offline Capabilities

Experimental results fully demonstrate the effectiveness of the TWW architecture. In tests based on Qwen3-VL (4B), under a single-turn streaming setting, TWW improved accuracy on StreamingBench from 58.52% to 60.04%, and on OVO-Bench, which focuses on real-world video understanding, it jumped from 50.70% to 55.02%.

Chart comparing accuracy metrics between TWW and baseline models

However, I believe the most inspiring data appears under the multi-turn dialogue protocol. While maintaining or even slightly increasing accuracy, TWW, leveraging its powerful segment memory reuse capability, reduced the average number of generated tokens by a staggering 56% (a 45.8% drop on OVO-Bench as well). In online businesses where latency and computational cost are extremely sensitive, halving the computational overhead without compromising capability is an improvement of immense commercial value. Furthermore, although this paradigm is designed specifically for streaming scenarios, it still demonstrates strong Zero-Shot generalization ability in offline long-video benchmarks (such as Video-MME and LV-Bench), proving that the "think while watching" mechanism naturally enhances the model's internal capability to handle extremely long contexts.

Graph showing token consumption reduction and performance metrics

Technical Inspiration and Limitations: The Real Challenge of Streaming Intelligence Lies in "Timing"

Although TWW demonstrates the huge potential of streaming multimodal inference, its current limitations also point the way for future research. The paper candidly presents several typical failure cases in the appendix, such as fine-grained entity identity forgetting over ultra-long spans, and memory pollution when 存在 frequent scene jumps.

Examples of failure cases in long-span entity tracking

A deeper challenge lies in "Premature Commitment under incomplete evidence." In actual testing, when a question is asked halfway through an action (e.g., "The player is taking a corner kick"), the model often fails to make a "let the bullet fly" judgment (i.e., wait for more context) and instead gives a definitive conclusion based on insufficient partial frames too early. This reminds us that true online intelligence requires not only understanding "what is happening" but also learning to judge "whether now is the best time to answer." Future work that could introduce audio cues or achieve adaptive segmentation based on the intensity of scene changes would allow this technology to unleash even greater power in the fields of embodied intelligence and real-time assisted driving.

Visualization of premature commitment errors in action recognition

To summarize in one sentence: The ultimate goal of streaming inference is not to infinitely expand the context window, but to master a dynamic memory engine that watches, records, and thinks simultaneously.

Advanced Learning

👉 If you want to systematically master frontier technologies and applications of multimodal large models, I recommend my premium course.

📚 The course covers mainstream multimodal architectures, multimodal Agents, data construction, training workflows, evaluation, and hallucination analysis, accompanied by multiple practical projects: LLaVA, LLaVA-NeXT, Qwen3-VL, InternLM-XComposer (IXC), TimeSearch-R video understanding, etc., including algorithm explanations, model fine-tuning/inference, service deployment, and core source code analysis.

💡 The course is currently being updated. You can participate via my personal website or Bilibili classroom:

📺 Bilibili Classroom (Click "Read Original" in the bottom left to jump directly): https://www.bilibili.com/cheese/play/ss33184

🌐 Official Website Link (requires scientific 上网 for access within China): https://www.tgltommy.com/p/multimodal-season-1

Multimodal Video Streaming Inference Efficiency Boosted by 56%: Unveiling TWW's Segment-Level Dynamic Memory Mechanism