How Can a Model Trained on 200M Real Tokens Match the Performance of 360M Data?

Pre-training data is running dry—the growth rate of computational resources far outpaces the growth rate of available web text. When data becomes the bottleneck, how can we extract more value from limited real data? A study from Stanford University offers a surprising answer: by "stitching" multiple synthetic variants of the same document into a single ultra-long "megadoc," data efficiency can improve from 1.48x to 1.80x.

The Starting Point: Can Synthetic Data Improve Modeling of the Original Distribution?

The paper addresses a core question: when pre-training is limited by data volume rather than compute, can synthetic data augmentation reduce the model's validation loss (i.i.d. loss) on the original web text distribution? The key here is "original distribution"—since synthetic data comes from a completely different distribution, does it actually help in modeling the original data?

The experimental setup is clear: using 200M real tokens (from 164,000 DCLM documents), the researchers trained an over-parameterized 300M parameter autoregressive Transformer with a context length of 4096. The goal was to train the best possible model under infinite training compute. The synthetic data generator used was Llama 3.1 8B Instruct.

Step One: Simple Rephrasing Brings Significant Gains

The paper first tested the simplest synthetic data method: generating multiple English Wikipedia-style rephrasings for each real document, with a temperature of 1 and a maximum generation length of 1024 tokens. On average, each rephrased document was 708 tokens, shorter than the average length of 1243 tokens for the original DCLM documents.

[Figure 2: Scaling Synthetic Generation Quantity] The left graph shows the trend of monotonic decrease in i.i.d. validation loss as the number of rephrasings (G) per document increases, approaching a plateau at 32 rephrasings. The right graph shows that the accuracy improvement in downstream benchmarks is consistent with the loss improvement trend.

During training, the paper divided the data into two streams: a real data stream and a synthetic data stream (containing G×D rephrasings plus D original documents). Training was conducted by carefully adjusting the mixing ratio and the number of epochs. The baseline model achieved an i.i.d. loss of 3.55 on 200M real tokens, while after mixing in 32 rephrasings, the loss dropped to 3.41, corresponding to a 1.48x data efficiency improvement. The average accuracy on downstream tasks (PIQA, SciQ, ARC Easy) improved by 5%.

However, a problem emerged: the loss curve clearly plateaued around 32 rephrasings, and continuing to increase the generation count yielded minimal returns.

Core Innovation: From "Multiple Short Documents" to "One Megadoc"

The standard rephrasing method treats all synthetic documents as independent samples and shuffles them randomly for training, but this ignores an important structure—multiple synthetic variants behind the same real document are highly correlated. The paper proposes a new perspective: concatenating multiple synthetic generations of the same document into a super-long "megadoc".

[Figure 3: Synthetic Data Flow of Megadocs] Simple rephrasing randomly arranges all generated results and real documents; Stitched Rephrasing concatenates all rephrasings of the same real document, with the real document placed at the beginning or end; Latent Thoughts inserts a reasoning chain connecting the prefix and suffix at fixed segmentation points of the document.

The paper proposes two methods for constructing megadocs:

(1) Stitched Rephrasing: Inspired by In-context Pre-training (ICPT), this method connects G rephrasings of the same real document with the original text using EOS tokens to form one long document. Unlike ICPT, which requires embedding and high-similarity traversal of the entire corpus, stitching synthetic data has almost zero cost because there is prior knowledge of which generated results are related to each other. Experiments found that placing the real document at the end of the megadoc works best. The paper speculates this is related to the idea that "reverse tasks are harder but more valuable"—reconstructing a more detailed real document from simplified rephrasings allows the model to learn more transferable structures.

(2) Latent Thoughts: Inspired by the Latent Thoughts method, this divides each document into G+1 equal-length fragments. At each segmentation point, the generator produces a rationale that derives the suffix from the prefix, which is wrapped in tags and inserted into the original text. On average, each thought segment is only 424 tokens, shorter than the average rephrasing of 708 tokens.

[Figure 1: Enhancing Synthetic Data by Scaling Generation Quantity and Using Megadocs] The baseline model reaches a loss of 3.55. Simple rephrasing (orange) improves monotonically with generation count but plateaus. Stitched Rephrasing (blue) and Latent Thoughts (gray) not only achieve lower loss but also exhibit a weaker plateau effect as generation count increases.

Results: Data Efficiency Jumps from 1.48x to 1.80x

[Figure 5: Generation Scaling for Stitching and Latent Thoughts] Stitched Rephrasing and Latent Thoughts outperform simple rephrasing on i.i.d. loss, long-context loss, and downstream benchmark accuracy, with the magnitude of improvement expanding as generation count increases.

At 32 generations, Stitched Rephrasing achieved 1.64x data efficiency, and Latent Thoughts reached 1.80x data efficiency, both significantly better than the 1.48x of simple rephrasing. More importantly, the gains from the megadoc method continue to expand as generation count increases: the best improvement was 0.02 at 4 generations, growing to 0.07 at 32 generations. The average accuracy on downstream benchmarks improved by 6% and 9%, respectively.

The effect is even more pronounced on long-context tasks. Tested on arXiv computer science papers, at 32 generations, loss improvements of 0.14 and 0.19 were achieved, respectively. The paper also verified that the megadoc method scales better than simple rephrasing even on short documents (under 600 tokens), indicating that the benefit is not limited to long contexts.

Why Does Megadoc Scale Better?

The paper attributes the advantage of megadocs to the叠加 of two effects: the constant-level loss improvement brought by the megadoc itself, and the ability to train for longer without overfitting. Specifically, Stitched Rephrasing allows increasing the number of epochs on real data from 16 to 32, and the mixing ratio from 0.75 to 0.9, increasing total training steps by approximately 5x without overfitting.

[Figure 6: Megadocs Benefit from More Training Steps on Real and Synthetic Data] When controlling for training steps, megadoc still shows improvement but no longer scales with generation count, indicating that longer training is the key source of its scaling advantage.

Combination with Ensembling Methods

The paper further verified whether synthetic data methods can stack with ensembling, a potent data efficiency method. Results show that self-distillation cannot combine with ensembling—the asymptotic loss of the ensembled self-distilled model (3.32) is nearly identical to standard ensembling (3.31). However, simple rephrasing, Stitched Rephrasing, and Latent Thoughts all three methods can combine with ensembling, with each method improving the ensembling asymptotic loss by at least 0.12.

[Figure 7: Combination of Rephrasing and Ensembling] Self-distillation cannot improve the ensembling asymptote, whereas all three synthetic data augmentation methods can, indicating their gains are fundamentally different from ensembling and self-distillation.

Discussion

The paper used an external generator stronger than the student model. However, three pieces of evidence support that the synthetic data gains are not purely a product of distillation: (1) When scaling the student model from 300M to 1.5B, the loss improvement magnitude for all three methods was larger, not smaller; (2) Prior work has proven that Llama 3.1 8B Instruct can self-improve through self-generated rephrasings; (3) Multiple studies have found that rephrasing generator capability does not significantly help beyond a certain scale, suggesting rephrasing is more of an augmentation method than distillation.

The value of synthetic data lies not only in "creating more data" but also in constructing better learning tasks. From the simple perspective shift of stitching multiple independent documents into one megadoc, the paper demonstrates how, when data is limited, we can design synthetic data algorithms that continue to benefit as computational resources grow.

Original Title: Data-efficient pre-training by scaling synthetic megadocs
Original Link: https://arxiv.org/abs/2603.18534

How Can a Model Trained on 200M Real Tokens Match the Performance of 360M Data?

Related Articles

分享網址