Category: Benchmarking
- OpenAI's Jiayi Weng: Beyond Gradients, Is the Next AI Training Paradigm on the Horizon?
- Claude 4.6 Only Scores 66%? Claw-Eval-Live Says: Fixing a Terminal ≠ Cross-System Capability
- QuantCode-Bench: A Benchmark for Evaluating LLM-Generated Quant Code Quality
- FrontierSWE
- KARL: Knowledge Agents based on Reinforcement Learning
- Can Models Truly "Reflect on Code"? Beihang University Releases Repository-Level Understanding and Generation Benchmark, Refreshing the LLM Understanding Evaluation Paradigm
- Multimodal Large Models Collectively Fail, GPT-4o Only 50% Safety Pass Rate: SIUO Reveals Cross-Modal Safety Blind Spots
- The 'Olympics' of AI? OpenAI Releases New Benchmark MRCR, Pushing Models' 'Needle in a Haystack' Ability to the Limit!