Category: Reinforcement Learning

SFT+RL Two-Stage Training Breaks Through LLM Self-Supervision! RUC DeepCritic Achieves Autonomous Evolution of AI Critique
R1-like Training No Longer Just Focuses on Result Correctness! CUHK Launches SophiaVL-R1 Model
The First Multimodal Dedicated Slow-Thinking Framework! Outperforms GPT-o1 by Nearly 7 Percentage Points, Reinforcement Learning Teaches VLM to "Think Twice"
10 Lines of Code, 15% Improvement in AIME24/25! Unveiling the Entropy Mechanism in Large Language Model Reinforcement Learning
Process Supervision > Outcome Supervision! Huawei City University Reconstructs RAG Inference Training, 5k Samples Outperform 90k Model
Reviewing the Progress of RL-Reasoning
AI Learns Reasoning Solely by "Confidence": Zhejiang University Alumnus Replicates DeepSeek's Long Chain-of-Thought Emergence, Reinforcement Learning Needs No External Reward Signals
Peking University Alumna Lilian Weng's Latest Blog Post: Why We Think
Will the Vision of LSTM's Father from 22 Years Ago Come True? AI 'Self-Evolution' Papers Concentratedly Released in One Week, Is a New Trend Emerging?
AI Math Ability Skyrockets 100%, Self-Evolution Nears RL Limits! CMU's New Work Overturns Perceptions
First Explanation of How LLMs Reason and Reflect: Northwestern University & Google's New Framework Introduces Bayesian Adaptive Reinforcement Learning to Comprehensively Enhance Mathematical Reasoning
LLM + RL Questioned: Deliberately Using Incorrect Rewards Still Significantly Boosts Math Benchmarks, Causing a Stir in the AI Community
Summary! Multi-Turn Planning Techniques in 2025 for Large Language Model Agent RL Training
Qwen Team Releases Long-Context Reasoning Model QwenLong-L1, Surpassing o3-mini
Thinking with Images Only: Reinforcement Learning Forges a New Reasoning Model Paradigm, Maximizing Complex Scene Planning!
How Does Claude 4 Think? Senior Researchers Respond: RLHF Paradigm is Out, RLVR Proven in Programming/Mathematics
Large Models Break Go AI's "Black Box" for the First Time, Paving New Paths for Scientific Discovery! Shanghai AI Lab Releases New-Generation InternThinker
ZeroSearch: <Alibaba Technology> Large Language Models Learn Through Self-Rewarding Without a Browser
Train a Model with Global Idle Computing Power, Performance Comparable to R1, Jensen Huang's Sky Has Fallen! Karpathy Once Invested In It
ZeroSearch: Zero-Search Reinforcement Incentivizes Model Potential, Ushering in a New Era for LLM Search Capability
Stanford's Weak-for-Strong (W4S): Harnessing Stronger LLMs with Meta-Agent, Accuracy Boosted to 95.4% | Latest
Can a single data point significantly enhance the mathematical reasoning performance of large models?
The 'era of experience' will unleash self-learning AI agents across the web—here's how to prepare
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning
NVIDIA's Llama Nemotron Series: Key Technologies Explained