Category: Model Evaluation

Ask, Don’t Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement
Latest Discovery: AI Large Models Know When They're Being Evaluated
Agent-World: Scaling Real-World Environments for Co-Evolution of Agents and Environments!
Claude 4.6 Only Scores 66%? Claw-Eval-Live Says: Fixing a Terminal ≠ Cross-System Capability
Deep Dive: Reward Hacking in Claude Code Model RL Training
How Much GDP Corresponds to the Tokens Burned by AI Models? AI's Economic Contribution Now Has a Number
Meituan Quietly Launches New Model! Real-Test of First Open-Source "Heavy Thinking" Model: 8-Way Parallel, Agent Hard-Clashes with Claude
Google's Challenge: DeepSeek, Kimi and More to Compete in First Large Model Showdown Starting Tomorrow
The More Reasoning, The More Hallucinations? The "Hallucination Paradox" of Multimodal Reasoning Models
Apple's 'Illusion of Thinking' Paper Criticized Again, Claude and Human Co-authored Paper Points Out Its Three Key Flaws
Google | Tracing RAG System Errors: Proposing a Selective Generation Framework to Boost RAG Accuracy by 10%