Category: Reinforcement Learning

When Async Agentic RL Meets 'Amnesia for Old Policies': Rethinking Off-Policy Correction
ICML 2026 | Teaching Multimodal Large Models to Think with Time: Peking University and Huawei Team Open-Source the TaRO Framework
The Flawless 'Flaw': Qwen and Fudan University Uncover Structural Dilemmas in Coding Agent Reward Design
Community Submission | Bailin Ling & Ring 2.6 Technical Report Released: Efficient Trillion-Parameter Models for Real-World Agent Workflows
Models Are Too Fond of Cheating! Cursor Reveals the Inside Story of Composer 2's Reinforcement Learning: Models Can Detect 'Fake Environments', and Floating-Point Non-Determinism Is a Fatal Flaw in RL Training
OpenAI Post-Training Lead: AI Isn't Suddenly Stronger, It Just Crossed a Threshold
Why Agent Training Always Crashes on Long-Horizon Tasks
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies in LLM Reinforcement Learning
OpenAI's Jiayi Weng: Beyond Gradients, Is the Next AI Training Paradigm on the Horizon?
Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior
Token-Level, Precision Length Control: 3B Model Beats GPT 5.4 and Claude
Agent-World: Scaling Real-World Environments for Co-Evolution of Agents and Environments!
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Z Tech | In Conversation with Zihan Wang: Leaving DeepSeek, and the Reverse Thinking That Defined My Journey
Deep Dive: Reward Hacking in Claude Code Model RL Training
Li Fei-Fei's Team Is Tackling This: From Entropy to Mutual Information, RAGEN-2 Reshapes Reasoning Quality Standards, Preventing AI Agents from Becoming 'More Trained, More Templated'
Make Thinking More Accurate and Extended! The New Reinforcement Learning Algorithm FIPO Arrives
ASI-Evolve: AI Accelerates AI
Is Synthetic Data Better Than Real Data?
SortedRL: Accelerates Large Model RL Training by 50%, Boosting Efficiency by 18%
Lin Junyang Speaks Out for the First Time After Leaving Alibaba: Reviewing Qwen's Detours, Pointing to AI's New Path
Let AI 'Refine' Its Own Data! DataChef Goes Open Source: Using Reinforcement Learning to Automatically Generate LLM Data Recipes
NVIDIA Nemotron-Cascade 2 Technical Report
ICLR 2026 | How Far Can Unsupervised Reinforcement Learning Go for Large Models? A Systematic Answer from the Tsinghua Team