Is Synthetic Data Better Than Real Data?

When training large models with reinforcement learning, the biggest headache isn't the algorithm—it's the data. Expanding real programming problems from 25K to 81K yields diminishing returns early on; simple problems cause model entropy collapse, while difficult ones waste compute due to sparse rewards. A Meta paper offers a counterintuitive conclusion: 20K synthetic problems generated via a multi-round synthetic data pipeline, combined with multi-environment training, significantly outperform 25K real programming problems in out-of-domain generalization.

Why Scaling Real Data Fails

The paper first compared RL training effects between 25K and 81K real programming problems on Qwen3-8B Base. The results were harsh: as policy entropy decreased, performance improvements plateaued early; increasing data volume did not yield proportional gains. The issue lies in data distribution—real problem sets inevitably mix大量 simple problems with a few extremely hard ones. Simple problems dominate training after providing early gradient updates, while extremely hard problems pose "hard exploration" challenges that initial models cannot solve, wasting computational resources.

Figure 3: Scaling experiments using real data on Qwen3-8B Base

[Figure 3: Scaling experiments using real data on Qwen3-8B Base] Comparing RL training with 25K vs. 81K real programming problems, performance gains plateaued early on both in-domain (LCB) and out-of-domain (Math500, AIME2024) benchmarks, indicating that merely scaling real data volume has limited returns.

Common curriculum learning strategies—training on simple problems first before gradually transitioning to harder ones—also fail on real data because there's often no meaningful progression between hard and simple problems. Moreover, consuming too much exploration budget on simple problems diminishes the model's ability to tackle harder challenges later.

Multi-Round Synthetic Data Pipeline: Dynamically Adapting Teachers to Students

The core solution proposed in the paper is a multi-round teacher-student synthetic data generation pipeline. Here's how it works:

Seed Sources: Two types: (1) Code snippets extracted from successfully solved real programming problems during initial RL training; (2) Randomly sampled 25-50 consecutive lines of code from the open-source code corpus starcoderdata as inspiration seeds.

Generation Process: Conducted over multiple rounds. In round one, the teacher model (GPT-OSS 120B in high-reasoning mode) generates a problem based on seed snippets and current RL environment rules, while the student model (same model in low-reasoning mode) attempts to solve it M=32 times. Starting from round two, the teacher receives the student's pass rate p and summaries of representative solutions, then adjusts problem difficulty accordingly—if pass rate exceeds 0.65, increase difficulty; if pass rate is 0, decrease difficulty. Each seed undergoes 6 iterative rounds, with each round based solely on the previous round's problem and student summary, not the full history.

Figure 1: Overview of the multi-round synthetic data pipeline

[Figure 1: Overview of the multi-round synthetic data pipeline] Seed snippets serve as the teacher's inspiration. In round one, the teacher generates initial problems and the student attempts solutions multiple times. In subsequent rounds, the teacher dynamically adjusts problem difficulty based on student performance summaries (pass rates and representative solutions). Invalid and duplicate generations are filtered and deduplicated before being added to the dataset.

Figure 2: Example of multi-round data generation

[Figure 2: Example of multi-round data generation] In round one, the student's pass rate for problems generated by the teacher was 0.875. In round two, after observing student performance, the teacher generated harder variants, reducing the pass rate to 0.25.

The key advantage: the entire adaptation process relies purely on in-context learning, requiring no gradient updates to the teacher model. Compared to independent single-round sampling, multi-round generation increases the retention rate of valid problems by approximately 4 times, while naturally producing "stepping stone" variants of varying difficulties.

Design of Four RL Environments

The paper defines four RL environments: Induction (program synthesis: synthesize functions given input-output pairs), Abduction (input prediction: infer inputs given functions and outputs), Deduction (output prediction: infer outputs given functions and inputs), and Fuzzing (fuzz testing: find inputs that cause test functions to fail). Each environment has clear teacher generation rules and binary reward functions.

Table 1: Overview of RL environments

[Table 1: Overview of RL environments] Lists teacher generation methods, student solving tasks, and reward function definitions for all four environments.

Experiments: Effects of Synthetic Data Across Models and Scales

The paper conducted systematic experiments on three models: Llama3.1-8B Instruct, Qwen3-8B Base, and Qwen2.5-32B Base. Evaluation benchmarks included in-domain LiveCodeBench (LCB, 454 problems) and out-of-domain Math500 and AIME2024.

Synthetic data augmentation shows significant effects. Keeping total training budgets unchanged, augmenting 25K real problems with 20K synthetic ones led to faster and more stable convergence across all three models on in-domain code benchmarks. Llama3.1-8B Instruct and Qwen2.5-32B Base also showed improvements on out-of-domain math benchmarks. Synthetic augmentation even surpassed the 81K real data baseline on most LCB metrics.

Figure 4: Synthetic data augmentation experiments on Llama3.1-8B Instruct

[Figure 4: Synthetic data augmentation experiments on Llama3.1-8B Instruct] Synthetic augmentation outperformed the baseline using only 25K real data on both in-domain (LCB) and out-of-domain (Math500, AIME2024) benchmarks.

Pure synthetic data is also competitive. Models trained solely on synthetic problems matched real data performance on LCB, but attention must be paid to difficulty distribution—without explicit regulation, datasets tend toward simple problems, leading to overfitting on easy tasks.

Difficulty, Curriculum, and Environmental Diversity: Three Key Dimensions

Regarding difficulty, the paper categorized problems into easy, medium, and hard tiers based on student pass rates. Training exclusively on medium-difficulty problems achieved the best balance between convergence speed and generalization; easy problems caused overfitting, while hard problems converged extremely slowly due to sparse rewards.

Figure 11: RL training of Qwen3-8B Base across different difficulties

[Figure 11: RL training of Qwen3-8B Base across different difficulties] Medium-difficulty training outperformed both easy and hard on LCBv5-medium and LCBv5-hard segments.

Regarding curriculum design, traditional easy-to-hard curricula反而 led to overfitting on easy segments. The paper experimented with reverse curricula (starting from medium or hard), finding that starting from medium achieved faster convergence and lower cross-seed variance, though starting from hard significantly increased variance. Notably, whether reverse curricula offer advantages over directly training only on medium data requires further verification.

Regarding environmental diversity, this is one of the paper's most striking findings. Distributing the 20K problem budget equally across four environments (5K each) versus allocating all to a single induction environment: multi-environment settings achieved significant improvements on out-of-domain benchmarks, higher in-domain pass@10, and avoided overfitting on easy segments. This trend was also verified on Llama3.1-8B Instruct—20K multi-environment synthetic problems even outperformed 25K real programming problems.

Figure 14: Scaling experiments on number of RL environments in Qwen3-8B Base

[Figure 14: Scaling experiments on number of RL environments in Qwen3-8B Base] Distributing data budgets across four environments improved both out-of-domain generalization and in-domain pass@10 compared to single-environment training.

Final Thoughts

The paper acknowledges several limitations: benefits of stepping-stone structures vary inconsistently across different curriculum strategies; gradient interference exists between difficulty levels during mixed-difficulty training; current data generation pipelines are decoupled from RL training, meaning teachers don't learn from actual students' online errors. The authors anticipate further improvements by incorporating teachers into the training loop for real-time adaptation to student weaknesses.

The core insight is clear: the bottleneck in RL post-training lies not in data scale, but in data structure, difficulty distribution, and environmental diversity. Multi-round synthetic pipelines offer a practical path to scalability, while "number of environments" as an independent scaling axis may be an underappreciated performance lever.

Original Title: A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

Original Link: https://arxiv.org/abs/2603.24202

#WuyingTemple


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.