Category: Model Evaluation
- Agent-World: Scaling Real-World Environments for Co-Evolution of Agents and Environments!
- Claude 4.6 Only Scores 66%? Claw-Eval-Live Says: Fixing a Terminal ≠ Cross-System Capability
- Deep Dive: Reward Hacking in Claude Code Model RL Training
- How Much GDP Corresponds to the Tokens Burned by AI Models? AI's Economic Contribution Now Has a Number
- Meituan Quietly Launches New Model! Real-Test of First Open-Source "Heavy Thinking" Model: 8-Way Parallel, Agent Hard-Clashes with Claude
- Google's Challenge: DeepSeek, Kimi and More to Compete in First Large Model Showdown Starting Tomorrow
- The More Reasoning, The More Hallucinations? The "Hallucination Paradox" of Multimodal Reasoning Models
- Apple's 'Illusion of Thinking' Paper Criticized Again, Claude and Human Co-authored Paper Points Out Its Three Key Flaws
- Google | Tracing RAG System Errors: Proposing a Selective Generation Framework to Boost RAG Accuracy by 10%