> Project Address: https://github.com/Infinity-AILab/DeepResearchEval
The Dawn of Deep Research AI Evaluation: How Do Humans Judge if AI Can Truly "Do Research"?
1. Research Background: When AI Starts "Deep Research," Who Evaluates Their Capabilities?
Imagine this scenario: You ask AI to research "How will the 2025 semiconductor export controls affect the IoT hardware supply chain." It needs to search multiple rounds of information, integrate academic papers, industry reports, and news articles, and finally generate a deep report of tens of thousands of words. This is what the current hottestDeep Research Systemsare doing—they are no longer simple Q&A machines but can, like human researchers, perform multi-step information retrieval, cross-validation, and multi-angle comprehensive analysis.
OpenAI's Deep Research, Google's Gemini Deep Research, Claude, and other top AIs are racing in this track. But the question is:How do we know the quality of the long research reports generated by these AIs? Which systems are truly reliable, and which are just "talking nonsense"?
Existing evaluation benchmarks have three major pain points:
Task Construction is Too Labor-Intensive: Requires experts to manually design research tasks, with high costs and slow updates. Evaluation Dimensions are Too Rigid: Uses the same set of standards to evaluate all tasks, unable to capture the special requirements of different research questions. Fact-Checking Has Blind Spots: Only checks content with citations; a large number of statements without annotated sources go unverified.
To address these issues, a research team from Nanyang Technological University and Shanda Group has launched theDeepResearchEval framework—a complete evaluation system that can automatically generate research tasks, intelligently assess report quality, and actively verify facts.
As seen in the figure above, they evaluated 9 mainstream deep research systems and found that Gemini-2.5-Pro scored the highest in comprehensive quality (8.51/10), while Manus performed best in factual accuracy (82.3% of statements verified as correct). More interestingly, all systems scored significantly lower on "task-specific dimensions" than on general dimensions, indicating that current AIs still have much room for improvement in meeting specific research needs.
2. Related Work: The Evaluation Track is Flourishing, but All Have Obvious Shortcomings
The evaluation of deep research systems is an emerging field; existing benchmarks each have their focus but all have limitations. The research team reviewed more than a dozen related benchmarks and found they can be divided into several categories:
Early Tool-Use Benchmarks (e.g., GAIA, HLE) mainly examine AI's reasoning and tool-calling capabilities but do not involve long report generation.Web Navigation Benchmarks (e.g., WideSearch, BrowseComp) focus on continuous web search and information retrieval, but the output format is short answers or tables, not complete reports.
Benchmarks specifically targeting deep research reports have only emerged in recent years, including DeepResearch Bench, LiveResearchBench, DRBench, etc. But they all highly rely on manual annotation—experts need to design tasks, write reference answers, and formulate scoring criteria, which leads to small task scale, difficult updates, and high costs.
In terms of evaluation methods, most benchmarks use fixed evaluation dimensions (e.g., all use "accuracy, completeness, clarity"), which cannot adapt to the special needs of different types of research tasks. In fact-checking, existing methods usually only verify parts of the report withannotated citations, and are powerless for a large number of statements without annotated sources (which may account for 30-50% of the report).
As shown in Table 1, DeepResearchEval is the first benchmark to simultaneously achieve the following five features:Automated task generation, output of long reports, no reference answers required, adaptive evaluation dimensions, and active fact verification. This enables it to continuously generate fresh, high-quality tasks, like a "living" benchmark, suitable for long-term monitoring of AI system evolution.
3. Core Method: Automatically Generating Real Tasks + Intelligent Hierarchical Evaluation
3.1 Task Construction: Generating Research Questions with "Personas"
Traditional methods let experts directly think of research questions, but experts' backgrounds and perspectives are limited. DeepResearchEval's innovation isusing "persona-driven" methods to automatically generate tasks. The entire process is divided into three steps:
Step 1: Persona Synthesis The research team first defined 10 broad domains (transportation, politics, finance, history, software development, industry, sports, health, technology, education), then had the LLM generate 5 roles with different backgrounds for each domain. Each role has a detailed personal resume, including affiliated institution, position, educational background, work experience, and professional sub-fields. For example, in the "industry" domain, an "IoT Engineer Ethan Kim, specializing in industrial sensor gateways, focusing on semiconductor supply chains" might be generated.
Step 2: Task Construction Based on Persona For each role, have the LLM generate 4 deep research tasks that match its professional background. These tasks must meet four hard requirements: (1) Require multiple rounds of search; (2) Must integrate multiple sources such as papers, reports, forums; (3) Include sufficient analysis depth (latest developments, data analysis, trend assessment, comparative research); (4) Have clear deliverables and time constraints. Finally, 200 candidate tasks are generated.
Step 3: Two-Stage Filtering The first filter, "Task Qualification Screening," evaluates whether the task really requires the latest knowledge, multi-source evidence, multi-level investigation, and whether it matches the role background, only retaining tasks with confidence >0.7. The second filter, "Search Necessity Screening," lets the LLM try to answer the task without using any search tools; if it can answer well without searching, the task is too simple and is directly eliminated.
This process ultimately produces 155 high-quality tasks. To verify the reliability of the automated process, the research team invited 7 doctoral experts to independently evaluate these tasks, and the results showed that80% of the tasks were recognized by at least 4 experts, proving that the quality of automatic generation can be comparable to manual design.
3.2 Quality Evaluation: Tailoring Scoring Criteria for Each Task
Traditional evaluation methods use the same set of standards to evaluate all reports, which is obviously unreasonable—like using the same exam paper to test math and language. DeepResearchEval proposes anAdaptive Point-wise Quality Evaluation framework.
This framework includes two types of evaluation dimensions:
General Dimensions (Fixed): Coverage, Insight, Instruction-following, Clarity. These four dimensions apply to all research reports.
Task-Specific Dimensions (Dynamically Generated): For each specific task, the LLM automatically generates 1-3 exclusive dimensions. For example, for the task "Compare the regulatory frameworks for electric scooters in the US, EU, and China," the system will generate dimensions like "Policy Pragmatism" and "Comparative Synthesis"; for the task "Evaluate the nutritional quality of plant-based meat products," it will generate "Classification Rigor" and "Cross-Regional Synthesis," etc.
Each dimension not only has a weight (representing importance) but is also further refined into multiple criteria, each with its own weight. The LLM scores each criterion (1-10 points, accurate to two decimal places), and the overall quality score is obtained through weighted aggregation:
The beauty of this design is that it retains general dimensions that are comparable across tasks while capturing the unique quality requirements of each task, achievinga perfect balance between generality and specificity.
3.3 Fact-Checking: Actively Verifying Every Sentence
Existing methods only check content with annotated citations in the report, but many statements generated by AI have no citations, or the citations are just decorative. DeepResearchEval developed anActive Fact-Checking Agent, which actively searches for external evidence to verify every verifiable statement in the report.
The verification process is divided into four steps:
Segment Processing: Divide the long report into multiple paragraphs for parallel processing and context preservation. Statement Extraction: Extract verifiable statements involving numbers, events, dates, locations, and people from each paragraph. Evidence Retrieval: For each statement, the agent calls search tools (Google Serper API) to find relevant evidence. Three-Category Judgment: Give labels based on evidence—Right: Statement supported by reliable sources; Wrong: Statement contradicts reliable sources; Unknown: Insufficient evidence to verify.
The cleverness of this design is the clear distinction between "unverified" and "wrong." The problem with many AI systems is not that they clearly say something wrong, but that they make vague statements that cannot be verified—by marking "Unknown," such risks can be clearly identified.
The final factual accuracy rate is calculated as:
4. Experimental Results: Who is the Strongest Deep Research AI? Where are the Gaps?
The research team conducted a comprehensive evaluation of 9 mainstream deep research systems, including OpenAI Deep Research, Gemini-2.5-Pro Deep Research, Grok4, Claude-Sonnet-4.5, Qwen3-235B, DeepSeek, Perplexity, Doubao, and Manus. Each system generated 100 reports, totaling 900 deep research reports evaluated.
4.1 Comprehensive Quality: Gemini Leads, with Clear Tiers
From the quality evaluation results,Gemini-2.5-Pro leads with a high score of 8.51, performing excellently in all dimensions, especially Coverage (9.2), Insight (9.0), and Instruction-following (9.7). Claude-Sonnet-4.5 ranks second (7.53), also showing balanced capabilities.
The middle tier includes OpenAI (7.28), Qwen (7.17), and Doubao (7.06). They score higher in Coverage and Instruction-following (over 8.5), but still have room for improvement in Insight and task-specific dimensions.
Relatively weaker performers are DeepSeek (5.25) and Manus (5.95). Although they are acceptable in Instruction-following, they are clearly insufficient in the breadth of information collection and depth of analysis.
The most notable finding: All systems scored significantly lower on "task-specific dimensions" than on general dimensions. This indicates that current deep research AIs generally have a problem—they are good at generating seemingly professional general reports, but still fall short in meeting the unique needs of specific tasks. For example, a policy analysis task requires "implementable policy recommendations" and "quantifiable safety indicators," but many systems only talk in general terms without providing truly actionable and specific content.
4.2 Factual Accuracy: The Trade-off Between Conservative Strategy and High Output Volume
In fact-checking, the ranking has interesting changes.Manus takes first place with an accuracy rate of 82.3%, followed closely by Gemini (76.62%) and DeepSeek (76.44%).
The data reveals a key trade-off:
High Accuracy, Low Output Volume Strategy: DeepSeek averages only 25.08 verifiable statements per report, but has a high accuracy rate of 76.44%, with only 1.81 wrong statements. This is a "conservative and cautious" strategy.
High Output Volume Strategy: Gemini and Doubao generate an average of 86.99 and 80.75 statements, with richer and more detailed content, but the accuracy rate decreases. However, it is worth noting that Gemini still maintains a 76.62% accuracy rate even with high output, which is quite rare.
Another interesting finding:All systems have far more "Unknown" statements than "Wrong" statements. For example, Perplexity has 16.10 unknown statements but only 9.08 wrong statements. This indicates that the main risk of AI systems is not directly saying something wrong, but making vague statements that sound reasonable but are actually unverifiable—this may be more dangerous than obvious errors, as users are more easily misled.
4.3 Evaluation Method Validation: Are AI Judges Reliable?
To verify the reliability of the evaluation framework, the research team conducted three validations:
Cross-Judge Consistency: In addition to the main judge Gemini-2.5-Pro, GPT-5 was used as the second judge. Although GPT-5 scored more strictly (generally lower scores), 7 of the 9 systems had completely consistent rankings, with only Doubao and Qwen swapping positions (differing by only 1 spot), indicating very stable rankings.
Random Stability: Using Gemini-2.5-Pro to run the evaluation independently three times, the rankings of all systems remained completely unchanged, with minimal standard deviations in scores, proving that the evaluation process is highly stable.
Human-Machine Alignment: Four experts manually annotated 80 statements, and compared with the AI judge's judgments, theconsistency reached 73%. More interestingly, the research team conducted an in-depth analysis of 20 inconsistent cases and found thatAI judgment was correct in 70% of cases, while humans were correct in 30%—mainly because AI can exhaustively search and verify, while human experts may miss certain evidence.
5. Paper Summary
The biggest contribution of this paper isproposing a sustainable and scalable evaluation paradigm for deep research AI. Compared with traditional benchmarks, it has three major breakthroughs:
Automated Task Construction: Through persona-driven methods, it can continuously generate fresh, high-quality research tasks that closely meet real needs, freeing itself from dependence on expensive expert annotations. This allows the benchmark to be continuously updated like a "living" benchmark, adapting to rapid technological iterations.
Intelligent Evaluation: The adaptive evaluation dimension design ensures comparability across tasks while capturing the unique needs of each task. This is more scientific than a "one-size-fits-all" fixed standard and can better discover the real shortcomings of systems.
Comprehensive Verification: Active fact-checking not only checks content with citations but also verifies a large number of statements without annotated sources, plugging the blind spots in fact verification, and clearly distinguishing different types of problems through three categories (Right/Wrong/Unknown).
Of course, this framework also has limitations.Currently, it mainly focuses on the English environment, with insufficient support for multi-language and cross-language evidence integration. Additionally,evaluation costs are high—frequent calls to Gemini-2.5-Pro and GPT-5-mini, plus frequent search API calls, will face cost pressures in large-scale or real-time deployment.
But the flaws do not obscure the merits; DeepResearchEval has set a new benchmark for the evaluation of deep research AI. As AI systems increasingly undertake complex research tasks, we need not only AIs that can generate long texts but also AIs that can do reliable research. This framework is like a "gold standard" for AI research capabilities—it tells us that a qualified deep research system must not only be able to search and summarize but also customize analysis according to specific needs and ensure that every statement is well-documented.
From the experimental results, even top systems like Gemini-2.5-Pro still have significant gaps in task-specific dimensions, and their factual accuracy is only between 76-82%.This means that deep research AI is far from being "fully trustworthy". We need better evaluation tools to continuously monitor their evolution—DeepResearchEval is a solid step in this direction.