Category: Reinforcement Learning
- Models Are Too Fond of Cheating! Cursor Reveals the Inside Story of Composer 2's Reinforcement Learning: Models Can Detect 'Fake Environments', and Floating-Point Non-Determinism Is a Fatal Flaw in RL Training
- OpenAI Post-Training Lead: AI Isn't Suddenly Stronger, It Just Crossed a Threshold
- Why Agent Training Always Crashes on Long-Horizon Tasks
- Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies in LLM Reinforcement Learning
- OpenAI's Jiayi Weng: Beyond Gradients, Is the Next AI Training Paradigm on the Horizon?
- Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior
- Token-Level, Precision Length Control: 3B Model Beats GPT 5.4 and Claude
- Agent-World: Scaling Real-World Environments for Co-Evolution of Agents and Environments!
- Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
- Z Tech | In Conversation with Zihan Wang: Leaving DeepSeek, and the Reverse Thinking That Defined My Journey
- Deep Dive: Reward Hacking in Claude Code Model RL Training
- Li Fei-Fei's Team Is Tackling This: From Entropy to Mutual Information, RAGEN-2 Reshapes Reasoning Quality Standards, Preventing AI Agents from Becoming 'More Trained, More Templated'
- Make Thinking More Accurate and Extended! The New Reinforcement Learning Algorithm FIPO Arrives
- ASI-Evolve: AI Accelerates AI
- Is Synthetic Data Better Than Real Data?
- SortedRL: Accelerates Large Model RL Training by 50%, Boosting Efficiency by 18%
- Lin Junyang Speaks Out for the First Time After Leaving Alibaba: Reviewing Qwen's Detours, Pointing to AI's New Path
- Let AI 'Refine' Its Own Data! DataChef Goes Open Source: Using Reinforcement Learning to Automatically Generate LLM Data Recipes
- NVIDIA Nemotron-Cascade 2 Technical Report
- ICLR 2026 | How Far Can Unsupervised Reinforcement Learning Go for Large Models? A Systematic Answer from the Tsinghua Team
- Stop Obsessing Over Outcome Rewards! CUHK Identifies and Solves the "Information Self-Locking" Problem in RL!
- Karpathy Just Open-Sourced AutoResearch: I Used It to Optimize Lobster Skills, Boosting Success Rates from 56% to 92%
- KARL: Knowledge Agents based on Reinforcement Learning
- OpenClaw-RL: Allowing AI Agents to Self-Evolve Through Chat