Current mainstream video agents share a common hidden risk: regardless of question difficulty, they tend to sample as many frames as possible and perform dense parsing. Methods like VideoAgent, VideoTree, and DVD require viewing thousands of frames on LVBench, essentially employing a brute-force strategy that trades computational power for accuracy. In practical scenarios, this comes at an extremely high cost—processing an hour-long video at 1 FPS with full input creates an engineering nightmare in terms of token consumption and latency. More critically, more frames do not equate to more information: there is high redundancy between frames, and the clues carrying the answer often lie within just a few seconds. This paper, VideoSeek by AMD and the University of Rochester, proposes a solution specifically targeting this fundamental contradiction.
Video Has a Logical Flow: This Is the Most Important Insight
The core insight of VideoSeek is not complex, yet it has been systematically overlooked by previous work: video content possesses a logical structure. Scene transitions, event sequences, and causal chains—these "video logical flows" are essentially free navigation maps. As long as a model can first establish a coarse-grained cognition of the video structure, it can predict which time segment is most likely to contain the answer, rather than blindly scanning from start to finish.
In terms of system design, this insight translates into three tools with progressive granularity: <overview> for global summarization, <skim> for coarse scanning and positioning of candidate segments, and <focus> for 1 FPS intensive reading of key short segments. In the think–act–observe loop, the agent decides at each step which tool to use based on existing evidence—not a preset coarse-to-fine pipeline, but a true on-demand invocation.
Engineering Details of Think–Act–Observe Are Worth a Close Look
In terms of the algorithmic process, VideoSeek is based on the ReAct style, using GPT-5 as the thinking LLM. In each round, it outputs a reasoning chain and a tool invocation plan. After execution, the observation is appended to the trajectory, with a maximum of several loops.
There are several design aspects I believe deserve special attention:
Tool constraint design is very strict. The prompt explicitly stipulates that only one tool can be called per round; skim can only be used for segments exceeding a threshold length, and focus can only handle short segments. This hard constraint prevents the model from being "lazy" and skipping coarse-grained steps to go directly to focus, forcing it to maintain hierarchical reasoning.
Frame budget parameter α is adapted across benchmarks. For LVBench videos, which average 67 minutes, α=4; for other benchmarks with shorter videos, α=2. overview samples frames, skim samples frames each time, and focus has an upper limit of seconds. This unified scaling design makes hyperparameter adjustment intuitive.
The value of intermediate reasoning is verified separately. The paper designed a GPT-5 control group—feeding the frames selected by VideoSeek directly to GPT-5 (without the agent loop). The result was 3.8 points higher than full-frame GPT-5, but still 4.5 points lower than VideoSeek. This indicates that the benefits come from two parts: better frame selection + multi-round intermediate reasoning; both are indispensable.
Numbers Speak: Efficiency and Accuracy Win Simultaneously
On LVBench (103 hour-level videos, 1549 questions), the subtitle-free version of VideoSeek achieved 68.4% accuracy with an average of 92.3 frames, outperforming all comparative video agents. With subtitles, it used only 27.2 frames to rise to 72.2%, whereas the second-best video agent required approximately 8,000 frames or more. The frame count difference is about 1/300; this is not a minor optimization, but a difference in magnitude.
Compared to the base model GPT-5 (384 frames, 60.1%), VideoSeek improved by 10.2 absolute points while saving 93% of the frames. Stable improvements were also observed on Video-MME long and LongVideoBench long.
Ablation experiments revealed the importance ranking of tools: removing overview dropped performance by 13.3 points, removing skim by 6.0 points, and removing focus by 4.7 points. The role of overview is so critical because without global structural cognition, subsequent targeted search is like a tree without roots.
The choice of thinking LLM also has a significant impact: when switched to GPT-4.1 (a non-thinking model), accuracy dropped from 68.4% to 53.0%, and the average number of rounds was lower (2.99 vs 4.42)—indicating that weaker reasoning models tend to end prematurely, with confidence and capability severely mismatched.
Clear Boundaries with Practical Reference Value for Engineering Implementation
VideoSeek works best for "videos with logical structure"—content like narrative videos, documentaries, and meeting recordings are naturally suitable. The paper also directly points out limitations: for scenarios like anomaly detection (where key evidence cannot be predicted in position through logical inference), this framework has limited effect.
Regarding runtime, attention is needed: although token consumption is low, multi-round calls lead to higher total latency (about 136 seconds) compared to a single GPT-5 call (66 seconds). The paper honestly states that runtime is affected by network latency and other factors, so it is not used as a reliable indicator—this honest attitude is commendable, but engineers deploying in real-time scenarios need to evaluate this themselves.
Overall, VideoSeek provides a clear design paradigm: use structured tools + reasoning loops to replace brute-force frame sampling, rather than simply stacking context length. For engineers currently building video understanding systems, this tool granularity division and Prompt design (complete prompts are available in the paper's appendix) have direct reference value.
Advanced Learning
👉 If you want to systematically master frontier technologies and applications of multimodal large models, I recommend my premium course.
📚 The course covers mainstream multimodal architectures, multimodal Agents, data construction, training workflows, evaluation, and hallucination analysis, accompanied by multiple practical projects: LLaVA, LLaVA-NeXT, Qwen3-VL, InternLM-XComposer (IXC), TimeSearch-R video understanding, etc., including algorithm explanations, model fine-tuning/inference, service deployment, and core source code analysis.
💡 The course is currently being updated. You can participate in learning via my personal website or Bilibili classroom:
📺 Bilibili Classroom (click "Read Original Text" in the bottom left corner to jump directly): https://www.bilibili.com/cheese/play/ss33184
🌐 Official Website Link (scientific internet access required for domestic visits): https://www.tgltommy.com/p/multimodal-season-1