Static Benchmarks ‘Outdated’? OpenKG Continues to Update the LLM Knowledge-Enhanced Dynamic Evaluation Leaderboard Dynamic OneEval-202605

Introduction: The Dynamic OneEval leaderboard has been updated for the May 2026 version. It continues to use an “automated data synthesis + human verification” approach, adding 190 new test cases to evaluate the reasoning capabilities of the latest models, including GPT-5.5, DeepSeek V4-pro, and QWEN-3.6-plus, on heterogeneous knowledge sources. The related research has been accepted by IJCAI-2026.

Evaluation Website & Online Leaderboard: http://oneeval.openkg.cn

Visit the official website for the full leaderboard, dataset description, and continuously updated results.

Introduction

The capabilities of large reasoning language models are improving at an unprecedented rate. However, the limitations of traditional static benchmarks are increasingly apparent: leaderboard scores keep rising, yet this may mask the models’ fragility under boundary conditions. The repeated use of benchmarks introduces data contamination risks, making it difficult for “high scores” to distinguish genuine generalization capabilities from rote learning in training. This problem is especially acute for knowledge-intensive reasoning tasks, where models must not only provide correct answers but also justify them with evidence, stay updated with the latest facts, and consistently complete multi-step reasoning chains.

Figure 1 The difference between Dynamic OneEval and existing evaluation benchmarks

In response, OpenKG launched the dynamic leaderboard Dynamic OneEval in February 2026. Different from static benchmarks, the core logic of Dynamic OneEval is to start from the model’s actual failure cases, transform the “why did the model make a mistake” insight into actionable generation constraints, and systematically produce evaluation samples that are both difficult and have traceable sources of difficulty.

Specifically, Dynamic OneEval implements a closed-loop upgrade from “error review” to “targeted difficulty reproduction.” First, it pinpoints the root cause of failures, identifying the model’s specific weaknesses in dimensions like knowledge memorization and multi-step reasoning. Second, it is constraint-driven synthesis, encoding failure patterns as generation constraints to synthesize challenging test samples in a targeted manner. Third, it is dynamic iterative updating, where evaluation results guide the generation of the next batch of samples, forming a continuously evolving evaluation loop.

This release of Dynamic OneEval-202605 adds 190 test samples and evaluates 18 mainstream domestic and international large language models, including GPT-5.5, Qwen3.6-Plus, and DeepSeek-V4-Pro. OpenKG will continue to update the OneEval benchmark platform and will subsequently release new version evaluation results. Stay tuned.

Outlook: Dynamic OneEval-202605 Overall Leaderboard

Table 1 Dynamic OneEval-202605 Overall Leaderboard

We evaluated 18 cutting-edge large language models using Dynamic OneEval-202605 under a unified experimental setup, including several of the latest models (GPT-5.5, Qwen3.6-plus, Deepseek-V4-pro). The results show that although Dynamic OneEval is constructed based on synthetic data, it remains highly challenging overall. The top overall score in this round, achieved by GPT-5.5, is only 56.2%. Looking at the score distribution, the rankings exhibit the following characteristics:

1. Intense Competition at the Top

GPT-5.5 tops the list with 56.2%, surpassing Claude-opus-4.6-thinking (55.3%), with a difference of only 0.9 percentage points. Gemini-3.1-pro (52.9%) and Gemini-3-pro (52.8%) follow closely. The gap among the top four is just 3.4 percentage points, indicating that the top-tier competition has entered a phase of “fighting for fractions of a percent.” Compared to the previous version, where Gemini-3-pro (46.4%) held a commanding 9.0 percentage point lead over the second-place model, the gap among top models has narrowed significantly.

2. Dense Upper-Mid Tier, Fierce Competition

Qwen3.6-plus (51.1%), Glm-5 (50.1%), Qwen3.5-plus (49.4%), and GPT-5.4 (47.6%) form the second tier. Following them, Claude-sonnet-4.5 (43.4%), Hunyuan-2.0 (41.2%), Deepseek-V4-pro (41.1%), and GPT-5.2 (40.5%) constitute the mid-range group, where the largest gap among these four is only 2.9%.

3. Knowledge Gaps Remain the Core Shortcoming for LLMs Compared to Reasoning Pitfalls

Most models scored significantly lower on K-Stress than on R-Stress. Deepseek-V4-pro’s Text K-Stress score was as low as 25.0% (compared to its Text R-Stress score of 55.0%), and its KG K-Stress score was just 8.0% (compared to KG R-Stress at 60.0%). This illustrates a significant asymmetry between knowledge gaps and reasoning ability.

Dynamic OneEval Construction Strategy

Figure 2 Dynamic OneEval Construction Process

Dynamic OneEval adopts a three-stage closed-loop construction strategy of “Structured Error Analysis — Dual-Perspective Instance Synthesis — Multi-criterion Gating” to continuously produce highly challenging dynamic evaluation samples that are traceable and not easily gamed by memorization.

Phase 1: Structured Error Analysis

This phase conducts a structured review of a model’s failure cases on the seed dataset. Utilizing a large language model as an analyzer, it reconstructs the model’s reasoning trajectory, pinpoints the failed reasoning step, and diagnoses the root cause type (e.g., entity linking confusion, reasoning after partial entity recognition, evidence omission). It then generates a structured “Difficulty Card” identifying which reasoning step went wrong and what input features triggered the error. This step transforms the “why did the model fail” insight into actionable constraints for subsequent generation.

Phase 2: Dual-Perspective Instance Synthesis

Based on the Difficulty Card’s diagnosis, new questions are synthesized from two complementary perspectives:

Knowledge-Stress (K-Stress): Targets cases where the model failed due to “knowledge gaps.” The original knowledge source is kept unchanged, but the missing critical fact is extracted and turned into an atomized “knowledge black box.” New facts are then extracted from the original knowledge source and combined with this black box to generate new question-answer pairs. This ensures the new question still relies on the fact the model doesn’t know, reliably reproducing failures caused by knowledge gaps.
Reasoning-Stress (R-Stress): Targets cases where the knowledge source contained sufficient information, but the model still made reasoning errors. It uses fictional entities to construct virtual knowledge sources (preventing the model from taking shortcuts via parameter memory) and employs a “reasoning skeleton” method to inherit the reasoning bottlenecks and triggering conditions from the original failure, generating new trap-like questions.

Phase 3: Multi-criterion Gating

After synthesis is complete, two independent large language model reviewers are introduced for quality control:

Answerability Review: Confirms the question is answerable under the corresponding stress type and that the answer has clear contextual support.
Consistency Review: Independently solves the question, verifying answer consistency and whether the difficulty points specified in the Difficulty Card are genuinely manifested in the question.

Only samples that pass both reviews are included in the final dataset, ensuring the output is high-quality evaluation data that is “difficult yet answerable, with controllable ambiguity.”

Knowledge Reasoning Rankings

We evaluated the performance differences of 18 frontier models under K-Stress (knowledge stress) and R-Stress (reasoning stress) across three knowledge types: Text, Knowledge Graph, and Table. Text and KG reasoning include both K-Stress and R-Stress dimensions, allowing a direct comparison of model capability divergence under “knowledge deficiency” and “reasoning trap” pressures. Table reasoning only sets R-Stress, reflecting models’ combinatorial execution abilities on highly structured data. Three grouped bar charts display model rankings and K/R-Stress comparisons for each dimension (see figures below). Overall, the gap between K-Stress and R-Stress not only reflects the capability distribution of models across different knowledge types but also reveals the structural weaknesses of current large language models in knowledge-intensive reasoning.

4.1 Text Reasoning

Figure 3 Dynamic OneEval-202605 Text Reasoning Leaderboard

Experimental results show that text reasoning is the sub-category with the highest overall difficulty, and most models generally exhibit a pattern of higher K-Stress scores compared to R-Stress scores. GPT-5.5 has the smallest gap (55.0% vs 45.0%, a 10 percentage point difference). Claude-opus-4.6-thinking and Gemini-3.1-pro both show a 30 percentage point difference. Qwen3.6-plus reaches a 45 percentage point gap. GPT-5.2’s gap is as high as 60 percentage points (65.0% vs 5.0%). This indicates that current models’ “high scores” in text reasoning rely more on stitching together surface-level cues and pattern matching rather than genuine logical derivation. When knowledge is hidden under stress pressure, models can still infer using contextual clues. However, when the reasoning path is systematically booby-trapped, the model’s logical chain is highly prone to collapse. This suggests that while knowledge gaps in text reasoning can be compensated for by context, the fragility of the reasoning chain is the deeper bottleneck.

4.2 Knowledge Graph Reasoning

Figure 4 Dynamic OneEval-202605 Knowledge Graph Reasoning Leaderboard

In stark contrast to text reasoning, KG reasoning displays a completely reversed pattern: all models scored significantly higher on R-Stress than on K-Stress. GPT-5.5 performed the most balanced (KG-K 42.0% vs KG-R 62.0%, a 20 percentage point difference). Meanwhile, Doubao-seed-1.6’s KG-K-Stress score was only 2.0%, while its KG-R-Stress score was as high as 62.0%, a difference of 60 percentage points. Deepseek-V4-pro showed similar extreme polarization (8.0% vs 60.0%). This extreme difference reveals that the structured representation of knowledge graphs naturally provides a “scaffold” for reasoning pressure. The entity-relationship paths in the graph constrain the reasoning direction, making it easier for models to search along edges and validate hypotheses. However, when critical facts are abstracted into a “knowledge black box,” the graph’s structural advantages turn into structural blind spots. Models cannot establish effective connections between missing nodes. This indicates that the current capability of KG reasoning models is more “structure-driven” than “knowledge-driven,” with the formalized graph structure masking substantial shortcomings in knowledge reasoning.

4.3 Table Reasoning

Figure 5 Dynamic OneEval-202605 Table Reasoning Leaderboard

Table reasoning only features the R-Stress test item, but its overall scores are significantly higher than the previous two dimensions. Qwen3.5-plus leads at 90.0%. The top five (Qwen3.5-plus, GPT-5.5, Gemini-3.1-pro, Claude-opus-4.6-thinking, Gemini-3-pro, Glm-5) all scored above 83.3%. In contrast, Llama-3.1-8b scored only 26.7%, creating a range exceeding 63 percentage points. This distribution suggests that the highly structured presentation of tables (cell alignment, row/column constraints) provides the model with clear operational boundaries, substantially reducing the uncertainty of free-form reasoning. The very fact that table knowledge is not easily “stressed away” reflects a characteristic: the bottleneck in table reasoning lies not in missing knowledge, but in the combinatorial execution capability of structural parsing and conditional constraints. Combining all three dimensions, the capability landscape of current models follows a decreasing pattern of “Table > Text > KG.” However, the reversal in K/R-Stress performance for Text and KG reasoning reminds us that a single score can easily mask structural weaknesses, and true reasoning robustness requires a comprehensive assessment under multiple pressure combinations.

Dynamic OneEval vs. Static Benchmarks

A key question is: Just how difficult are the Dynamic OneEval questions?

We compared the results on Dynamic OneEval with performance on the seed dataset. Taking DeepSeek-V3.2 as an example, its performance on Dynamic OneEval was significantly lower than on the original seed dataset: dropping from 80% to 30% on text reasoning and from 70% to 38.0% on KG reasoning. This performance drop demonstrates that Dynamic OneEval exposes the deep vulnerabilities in models’ knowledge reasoning by preserving and reproducing real failure patterns.