The Nemotron-Cascade 2, a Mixture-of-Experts model with only 3B active parameters, achieved a gold medal score of 35 out of 42 at the 2025 International Mathematical Olympiad (IMO). It also secured a gold medal with 439.28 points at the International Olympiad in Informatics (IOI) and solved 10 out of 12 problems at the ICPC World Finals. Previously, such performance was believed to be attainable only by closed-source models with hundreds of billions of parameters. With a meticulously designed post-training pipeline, NVIDIA's Nemotron-Cascade 2 demonstrates that smaller models can exhibit astonishing intelligence density.
Cascade Reinforcement Learning: Tackling RL Environments One by One
The core methodology of Nemotron-Cascade 2 is Cascade RL, which conducts reinforcement learning (RL) training sequentially by domain rather than mixing all tasks together. Inherited from its predecessor, Nemotron-Cascade 1, this framework offers three key advantages: (1) Strong resistance to catastrophic forgetting across different RL stages, ensuring previously acquired performance hardly degrades; (2) The ability to independently adjust hyperparameters and training curricula for each stage; (3) More uniform response lengths and verification times within the same domain, significantly saving computational resources.
[Figure 2: Flowchart of Nemotron-Cascade 2 performing sequential cascade RL training by domain after SFT] The paper illustrates the complete training workflow starting from Supervised Fine-Tuning (SFT), followed sequentially by IF-RL, Multi-domain RL, MOPD, RLHF, Long-Context RL, Code RL, and finally SWE RL. Each stage delivers significant improvements in its corresponding domain.
Specifically, the entire Cascade RL pipeline executes in the following order: First, IF-RL (Instruction Following RL) establishes basic instruction-following capabilities. This is followed by Multi-domain RL to simultaneously improve STEM multiple-choice questions, agent tool invocation, and structured outputs. Next comes MOPD (Multi-domain On-Policy Distillation), followed by RLHF for human preference alignment. The process continues with Long-Context RL, Code RL, and concludes with SWE RL to handle software engineering agent tasks.
The determination of the stage order is not fixed but dynamically decided by model behavior. The core principle is to find an arrangement that minimizes cross-domain negative interference. For instance, while IF-RL may impair human alignment capabilities (such as ArenaHard scores), subsequent RLHF has a negligible impact on instruction following; therefore, IF-RL is scheduled earlier.
The entire training process utilizes the GRPO (Group Relative Policy Optimization) algorithm, strictly adhering to on-policy training—each step generates a rollout using the current policy and performs a single gradient update. The importance sampling ratio remains constant at 1, and the KL divergence term is completely removed.
Multi-Domain On-Policy Distillation: Using Optimal Checkpoints from the Training Process as Teachers
Although Cascade RL significantly reduces catastrophic forgetting, performance fluctuations still exist as the number of RL environments increases. The paper introduces MOPD as a critical stabilization phase. Its core concept is to select the best-performing checkpoint for each benchmark category from the various stages of Cascade RL to serve as domain teacher models for on-policy distillation to the student model.
[Figure 3: Training dynamics and downstream evaluation] The paper compares the convergence speed of MOPD and GRPO on AIME25. Under identical mathematical training settings, GRPO improved from 89.9 to 91.0 over 25 steps, whereas MOPD reached 92.0 and recovered to the teacher's level within 30 steps, demonstrating a significant advantage in training efficiency.
The appeal of MOPD lies in three aspects: Teacher models are selected directly from the Cascade RL pipeline without introducing external models; all teachers share the same tokenizer, reducing distribution shift; and MOPD provides dense token-level training signals, far superior to the sparse sequence-level rewards of GRPO. On ArenaHard v2, MOPD improved the Hard Prompt score from 71.5 to 85.5 in just 52 steps, whereas RLHF required 160 steps to reach only 80.7.
[Table 3: Comparison of MOPD and RLHF on ArenaHard V2.0] Under matched evaluation checkpoint conditions, MOPD achieved higher scores on both Hard Prompt and Creative Writing in fewer steps.
SFT Data: Large-Scale Distillation Covering Ten Major Domains
The SFT phase injects foundational capabilities into the model. Nemotron-Cascade 2's SFT data covers ten major domains: Mathematics (including 1.8 million tool-invocation samples and 2.6 million non-tool samples), Code Reasoning (1.9 million Python and 1 million C++14 reasoning trajectories), Science (2.7 million samples), Long-Context (234,000 samples), General Dialogue (approximately 10 million samples), Instruction Following, Safety, Dialogue Agents, SWE Agents, and Terminal Agents. All samples are packed into sequences of up to 256K tokens, with optimal performance achieved after approximately 1.5 epochs of single-stage training.
Competition-Level Performance and Comprehensive Benchmark Leadership
[Table 1: Main Results] Nemotron-Cascade-2-30B-A3B surpasses Qwen3.5-35B-A3B and the larger Nemotron-3-Super-120B-A12B across multiple categories including mathematics, code reasoning, alignment, and instruction following.
[Table 2: IMO 2025, IOI 2025, and ICPC World Finals Performance] The model achieved full marks on the first five problems of IMO 2025 (7 points each for P1-P5), obtained a gold medal score of 439.28/600 at IOI 2025, and solved 10 out of 12 problems at the ICPC World Finals.
In mathematics, it reached 92.4 on AIME 2025 (98.6 with tool use) and 94.6 on HMMT Feb25. In code reasoning, it achieved 87.2 on LiveCodeBench v6, with a Codeforces ELO estimate of 2320. Notably, it became the first small model to achieve a non-zero pass rate on LiveCodeBench Pro Hard. In alignment tasks, it scored an average of 83.5 on ArenaHard v2 and 82.9 on IFBench. On the 1M token NIAH test, it reached 99.0.
[Table 6: Comprehensive Benchmark Results for Competitive Programming] The paper compares Nemotron-Cascade-2 with numerous baseline models, including DeepSeek-V3.2-Speciale and GPT-OSS-120B. After integrating tool-based reasoning, the model's performance matches top-tier open-source models with over 300B total parameters.
The model still lags behind Qwen3.5-35B-A3B in knowledge-intensive and agent tasks, indicating that stronger knowledge pre-training and agent RL are future directions for improvement. Notably, Nemotron-Cascade-2 shares the same pre-training base as Nemotron-3-Nano-30B-A3B but surpasses it on almost all benchmarks, directly proving the effectiveness of the Cascade RL plus MOPD training pipeline.
Nemotron-Cascade 2 has fully open-sourced its model weights, training data, and methodological details. As the arms race for large models continues, this work proves that advancements in post-training methodologies can enable small models to unleash capabilities far exceeding their parameter count—achieving an IMO gold medal line with only 3B active parameters. This is perhaps one of the most cost-effective AI inference solutions of 2025.
Original Title: Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Original Link: https://research.nvidia.com/labs/nemotron/files/Nemotron-Cascade-2.pdf
#WuyingTemple