Dataset: https://huggingface.co/datasets/Halcyon-Zhang/BrowseComp-V3
When Multimodal Large Language Models (MLLMs) connect to the internet, can they truly "sift through the coarse to find the fine" amidst a sea of text and images like humans, completing deep investigative research?
In recent years, from GPT-4o to the latest GPT-5.2 and Gemini-3-Pro, foundational model capabilities have achieved leapfrog development. Bolstered by tool invocation, these models are accelerating their evolution into Multimodal Browsing Agents. However, when we place these frontier models in real, open internet environments filled with noise and cross-modal information, their actual performance often fails to fully meet expectations, facing numerous limitations in deep reasoning and complex information integration.
Recently, a research team led by Peking University, in collaboration with top institutions including the Hong Kong University of Science and Technology (Guangzhou), Tsinghua University, and Huawei Cloud, jointly launched a new multimodal deep search benchmark: BrowseComp-V³, and simultaneously developed the general multimodal browsing agent framework OmniSeeker.
The experimental results are thought-provoking: When facing real-world open-world multimodal deep search tasks, human experts achieved a success rate of 68.03%, while even the most powerful closed-source model, GPT-5.2, achieved a success rate of only 36.17%.
This work not only reveals the capability boundaries of current visual agents in complex environments but also points the way forward for future multimodal large models toward "slow thinking" and "long-horizon planning."
I. Why Do We Need a New Multimodal Search Benchmark?
Prior to this, early benchmarks represented by MM-BrowseComp and MMSearch-Plus had already introduced multi-hop designs and fine-grained visual reasoning for exploring visual agents, driving the initial development of the field.
However, if we turn our gaze to high-order search scenarios in the real world, existing evaluation systems still exhibit obvious limitations:
- Insufficient Task Complexity: Early benchmarks (such as MMSearch) were mostly limited to shallow retrievals within two hops, with visual information often appearing only in the initial stages. This is like giving AI an "open-book exam" where clues are too direct, failing to reflect the pain points of deep search in the real world where text and images are intertwined and layers progress sequentially.
- Key Information Unretrievable by Tools: In some existing complex benchmarks, core evidence is often hidden in video frames or private documents not open to the public. This means that even if an agent's logic is correct, it may fail because the "tools simply cannot find it." This severely undermines the fairness and reproducibility of the benchmark.
- Overly Singular Evaluation Dimensions: The vast majority of existing research focuses only on "whether the final answer is correct," ignoring the behavioral trajectory of the agent during the multi-step search process. This "black-box" evaluation makes it difficult to diagnose exactly where a model stumbled—was it visual perception, information retrieval, or logical reasoning?
To break these bottlenecks, BrowseComp-V³ was born.
II. BrowseComp-V³: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
BrowseComp-V³ is a brand-new benchmark specifically designed to evaluate multimodal deep browsing and search capabilities. It contains 300 carefully curated, highly challenging high-order questions spanning five major domains: Science, Technology, Society, Culture, and Life (covering 24 sub-domains). The core design philosophy of this benchmark can be summarized by three major principles:
1. Multi-dimensional Cross-modal Coverage (Real Complex Reasoning Depth)
To realistically simulate reality, BrowseComp-V³ maximizes difficulty in two dimensions. First, it extends the search path through Multi-hop variants. Second, the team categorized the complexity of cross-modal interaction into three progressive levels:
- Level 1 (Intra-region Alignment): Focuses on fine-grained text-image alignment capabilities within local regions.
- Level 2 (Cross-region Integration): Requires the model to comprehensively process and piece together visual and textual information distributed across different sections within a single image.
- Level 3 (Cross-image Reasoning): Challenges the model's ability to perform associative cognition and complex reasoning across multiple independent images and web pages.
This design completely eliminates the possibility of models taking "shortcuts" by relying on single text clues or internal parametric knowledge.
2. Process-Oriented Fine-Grained Evaluation (Breaking "Result-Only" Dogma)
In addition to the standard Success Rate (SR), the research team introduced the Process Score (PS).
Expert teams manually annotated mandatory "intermediate sub-goals" for each task. During evaluation, not only is the model's final answer considered, but the system also tracks how many sub-goals it successfully completed during the evidence collection phase. This mechanism allows researchers to pinpoint failure modes precisely, like "watching a replay" (e.g., did it misread the image, or search for the wrong term?).
3. Absolute High Reliability and Reproducibility
To ensure fairness, BrowseComp-V³ established extremely rigorous data filtering standards:
- All key evidence must be retrievable via public search engines.
- The team even provided a purely manually annotated "Golden Search Trajectory" for every question.
- Questions lean towards objective, time-invariant knowledge, ensuring the standardization and long-term validity of automated evaluation.
Figure 1: Statistical information of BrowseComp-V³
III. How Was the Data Forged? Five-Stage Rigorous Quality Control
Constructing such a high-quality dataset is no easy feat. More than 20 master's and doctoral researchers with professional backgrounds in artificial intelligence and related fields participated in the construction of BrowseComp-V³. The entire process followed a closed-loop five-stage quality assurance framework:
- Initialization and Guideline Formulation: Expert teams defined core evaluation dimensions and wrote high-quality initial examples (including visual inputs, queries, sub-goals, answers, and metadata) to establish a "Golden Standard."
- Tool-Enhanced Exploratory Annotation: Annotators were assigned tasks based on their professional domains and used a toolkit including text search, web access, image search, and image cropping to perform real open-ended web surfing. They were required to record complete interaction trajectories and decompose sub-goals.
- Dual Verification and Adversarial Filtering: Collected data first underwent "manual reproduction verification" by independent inspectors to ensure logical coherence and solid evidence. Subsequently, SOTA visual large models (such as GPT-5.2, Gemini-3-Pro) were used for cleaning, directly filtering out "easy questions" that models could answer effortlessly, retaining only high-difficulty samples that truly possess long-tail distributions or require complex reasoning.
- Structured Format Conversion: Complex interaction trajectories and multimodal data were converted into a unified, machine-readable standard JSON format.
- Expert Final Review: Domain experts conducted a final audit for safety, privacy compliance, and factual accuracy.
Figure 2: Data construction process
IV. Experiment Reveal: How Big is the Real Gap Between Humans and AI?
To conduct a comprehensive evaluation, the research team set up four testing environments: Human Experts, Tool-Free MLLMs (naked exam), Official Tool-Enhanced MLLMs, and MLLMs under the OmniSeeker framework.
Core Finding 1: A Cliff-like Performance Gap
The test results are brutal. Human experts with PhD-level domain knowledge backgrounds, using standard browsers, achieved an average Success Rate (SR) of 68.03% and a Process Score (PS) as high as 82.93%.
In contrast, no large model broke the 40% success rate barrier. The current strongest, GPT-5.2, achieved a success rate of only 36.17%. This powerfully proves that BrowseComp-V³ successfully captures the extreme complexity of real-world search.
Core Finding 2: Tool Invocation is a "Life-Sustaining Drug"
In the "Tool-Free" naked exam setting, the success rate of most models plummeted to around 10%. This indicates that facing dynamic, long-tail cross-modal evidence chains, large models relying solely on "rote memorization" of parametric knowledge in their "bellies" is completely insufficient. Real-time retrieval and interaction capabilities with the environment are absolute necessities for achieving deep multimodal reasoning.
Core Finding 3: The Strong Counterattack of Open-Source Models
Although closed-source giants (like GPT-5.2) still occupy the top spot, excellent open-source models are rapidly narrowing the gap. Especially when equipped with the unified OmniSeeker agent framework, Doubao-Seed-1.8 demonstrated extremely strong complex reasoning capabilities, with its success rate soaring to 33.67%, even approaching some top-tier closed-source systems. This provides immense confidence for building cost-effective open-source web browsing agents in the future.
Core Finding 4: The Truth Revealed by Process Scores (PS)
Experiments universally found that models' Process Scores (PS) were far higher than their final Success Rates (SR). This indicates that models can often stumble through the first few simple sub-goals, but in long-sequence tasks, they are prone to "losing the thread," unable to maintain logical coherence, ultimately failing at the last hurdle.
Figure 3: Main experimental results
V. Deep Analysis: Where Exactly are Models "Stupid"?
To investigate the root causes of model failure, the research team conducted further fine-grained analysis.
1. The Deeper the Task Complexity, the Faster the Collapse
From Level 1 to Level 3, as the demand for cross-region integration and cross-image reasoning increased, model performance slid. This exposes that while current MLLMs can understand single images, they still seem inadequate when handling page-level interlaced text and images and multi-image associative reasoning.
2. Essential Differences in Capability Boundaries: Humans Lack Energy, AI Lacks Fusion
Interestingly, as the search path (Hop) increases, the human success rate drops even more steeply than that of the models. The human bottleneck lies in information overload—reading large amounts of long text consumes cognitive energy extremely; whereas large models, benefiting from huge context windows, can read long texts without breaking a sweat.
However, the model's true Achilles' heel lies in "Multimodal Integration" and "Visual Grounding." In complex web layouts and noise, models often become "blind," unable to accurately extract and perceive key visual clues.
3. Endowing AI with "Slow Thinking": The Power of Test-Time Scaling
The research team also explored the impact of increasing computation at test time on performance. The results are inspiring:
- Increasing Interaction Rounds: Giving agents more exploration steps significantly improves performance. Especially larger parameter models (such as Qwen3-VL-235B) demonstrated stronger long-horizon reasoning advantages, better utilizing extra rounds for trial and error and self-correction.
- Best-of-N Sampling Strategy: Letting the model search independently and in parallel multiple times, then selecting the best answer from them, proves more scalable than simple voting mechanisms and can continuously boost the final success rate.
VI. Conclusion and Outlook
By proposing the BrowseComp-V³ benchmark and the OmniSeeker framework, this paper conducts a systematic and in-depth study on the core issue of multimodal large models in "open-world deep search." The research results clearly indicate that merely granting models basic visual perception and simple tool invocation capabilities is far from enough.
To truly unleash the potential of multimodal browsing agents, future research needs deeper innovation in the deep integration of cross-modal information and long-horizon planning, to promote genuine synergistic gains between visual perception, dynamic retrieval, and complex logical reasoning. BrowseComp-V³ provides a reliable yardstick for measuring this progress and hopes to offer useful references and new directions for the development of the multimodal agent field.