Category: AI Evaluation

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Is Agentic RAG Worth It? A Four-Dimensional Real-World Test Reveals the Answer!
From 'LLM-as-a-Judge' to 'Agent-as-a-Judge': A Review of the Three-Stage Evolution of AI Evaluation Paradigms
0% Pass Rate! The Code Myth Debunked! LiveCodeBench Pro Released!
Comprehensive Evaluation of 12 Latest GraphRAG Techniques
ICML 2025 | Bursting the AI Bubble with 'Human Testing Methods': Building a Capability-Oriented Adaptive Assessment New Paradigm
Can LLMs Understand Math? Latest Research Reveals Fatal Flaws in Large Models' Mathematical Reasoning
AI's Second Half: From Algorithms to Utility