This paper offers a data-center perspective: larger models are stronger not only because they can represent more, but also because they better retain long-tail tasks.
Why are large models stronger than small models?
More parameters, more data, more compute—model capabilities rise accordingly. This has become the stable experience of large model development over the past few years.
The harder question is: what exactly do large models learn more of? Are they tasks that small models simply cannot represent, or tasks that small models can represent but struggle to stably learn during pretraining? If given more data and longer training, would small models eventually catch up?
A new paper from Stanford, Harvard, MIT, Anthropic, and other institutions provides a more concrete explanation:
Large models' advantage is not just stronger expressivity, nor just higher sample efficiency.
Often, small models don't completely fail to learn; rather, they cannot retain those low-frequency, complex task signals during mixed-data training.
Paper Title:
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Paper URL:
https://arxiv.org/abs/2605.29548
The extra capacity of large models reduces the overwriting of low-frequency tasks by high-frequency tasks, allowing weak signals from rare tasks to survive subsequent training instead of being immediately washed away.
Data Scaling Can Partially Close the Gap, But Another Part Requires Model Scaling
What Extra Capabilities Do Large Models Gain?
The paper first breaks "large models are stronger" into two scenarios.
One type of gap can be closed by data scaling. Small models underperform large models under limited compute, but if data or training resources continue to increase, they theoretically still have a chance to catch up. Here, large models are essentially learning faster and more sample-efficiently.
The other type of gap must rely on model scaling. Even considering infinite-data limit performance, small models still cannot reach the loss that large models achieve under finite compute. This means there exists a portion of the training distribution that small models struggle to learn under the same training conditions.
In the same mixed training data, which tasks get learned first, and which get squeezed out?
Tasks in real pretraining corpora are not uniform. Behind the language modeling objective lie numerous subtasks: some high-frequency, some low-frequency; some simple, some requiring more structure to generalize. When model capacity is limited, these tasks compete for the same representation resources.
What Does the Model Learn First?
The authors constructed a synthetic multi-task regression experiment. Each task has two key attributes: occurrence frequency and complexity.
The more frequently a task appears in the data, the greater its impact on overall loss. Complexity is characterized through the task's feature spectrum. Slower spectral decay means the task requires more feature directions to learn well, making it harder for a limited-width model to fully retain.
The core ranking rule is:
where f is task frequency, and λ is the importance of a feature direction within the task. Their product is the utility of that feature.
When model width is d, the model prioritizes retaining the top d features with highest utility. The model does not allocate capacity uniformly; it prioritizes features that most reduce overall loss.
Therefore, high-frequency and low-complexity tasks are easier to learn. Low-frequency, complex tasks are not necessarily inexpressible, but they rank lower in resource competition. Once a small model's capacity is filled with high-utility features, long-tail tasks struggle to enter the model's representation.
As Model Width Increases, Low-Frequency Tasks Begin to Be Learned Step by Step
In the figure above, the authors trained models of different widths on a mixture of 32 regression tasks with varying frequencies. Results show that as model width increases, the model begins to retain lower-utility features and better learns low-frequency tasks. The experimental trend aligns with Theorem 3's utility ranking.
The extra parameters of large models give originally lower-ranked tasks a chance to enter the representation space.
Gradient Interference and Signal Retention
Low-frequency tasks appear rarely; for a model to learn them, it must retain existing signals across multiple rare sample occurrences.
When rare task samples appear, small model parameters do update toward this task. But before the next rare task appears, large amounts of high-frequency task samples continue updating the same parameters, quickly overwriting the just-written rare task signals.
The paper summarizes this dynamic as an update-forget cycle: a rare task appears once, the small model briefly writes relevant signals; high-frequency tasks continue training, signals gradually decay; by the time the rare task appears again, the model has nearly returned to square one.
When model width is sufficiently large, common tasks can be explained more thoroughly first. As common tasks' residual signals decrease, their gradient pull on parameters also weakens. Rare task updates are no longer so easily washed out, and the model can accumulate multiple low-frequency observations.
Theorem 4 provides the intuitive conclusion: the overall gradient of common tasks is controlled by residual signals. When common tasks are not yet well-learned, they continuously occupy update directions; the more thoroughly they are explained, the weaker the interference, and the more remaining capacity becomes available for rare tasks.
Only After Common Task Residuals Decline Do Rare Task Signals Stably Enter Representations
In the figure above, the small model still has large amounts of common task residual signals to explain, so rare task signals remain near random; when model width crosses the paper's predicted threshold, common task residuals decline, and rare tasks begin to be stably encoded.
Small Models Briefly Encode Rare Tasks Then Quickly Decay; Large Models Retain and Accumulate Signals
In the figure above, the authors kept the rare task's overall frequency constant but varied the interval between adjacent injections. Small models briefly encode the rare task after each injection but then rapidly decay; large models retain more signal between injection intervals and gradually accumulate during training.
This means large models' advantage comes not only from being able to represent more content, but also from stronger ability to retain low-frequency task signals.
OLMo Pretraining Verification
The paper also validates this mechanism in the OLMo pretraining pipeline. Experiments trained OLMo models at five scales: 4M, 20M, 300M, 1B, and 4B parameters, up to 210B tokens.
Pretraining corpus used Dolma v1.7. To control task frequency, the authors injected two tasks unlikely to naturally appear in standard pretraining data: comparison task TCMP and modular addition task TADD.
These are not simple memorization tasks. Each has 10K instances, split half train/half test. The comparison task requires learning global token ordering structure; the modular addition task requires capturing Fourier patterns. Test accuracy measures whether the model learned generalizable structure, not just memorized training samples.
In OLMo Pretraining, Larger Models Better Learn Low-Frequency Injected Tasks
Behavioral results match synthetic experiments: larger models better learn lower-frequency injected tasks; small models show higher training loss and lower test accuracy on low-frequency tasks.
The authors didn't stop at loss—they further traced to the representation and gradient levels.
At the representation level, as model scale and task frequency increase, TCMP's global ordering features and TADD's Fourier features appear more clearly in the model's internal representations.
When Models Are Larger and Task Frequencies Higher, Task-Relevant Features More Clearly Enter Representation Space
At the gradient level, the authors focused on a group of task-relevant neurons during TCMP training runs, analyzing cosine similarity between batch gradients and task reference directions.
They then decomposed batch gradients into task-token gradients and non-task-token gradients.
Large Models' Non-Task Gradients Interfere Less with Task Directions
Results show that large models carry clearer task signals during task injection, with non-task token gradients barely interfering with task directions; small models are more prone to random collisions and interference.
Three layers of evidence point to the same conclusion: the larger the model, the less mutual overwriting between tasks.
Implications Beyond Scaling
This paper does not attribute scaling's advantages to a single cause. Large models certainly have stronger expressivity and often better sample efficiency.
The discussion section also emphasizes that this explanation is not a complete theory of scaling, but complements expressivity and sample efficiency.
What this paper truly supplements is another layer of the problem. In mixed-data training, capability is determined not only by whether the model can represent something, but also by whether gradient optimization can stably learn it from the current data distribution.
If the target capability is itself a low-frequency, complex task, scaling the model is not the only option. Adjusting data mixture ratios and increasing target task frequency may be more efficient than simply scaling the model. How to systematically reduce inter-task gradient interference remains for future research.
The paper also hints that memorization is not always a side effect in training. On rare tasks, it may be a prerequisite for the model to accumulate signals across batches and eventually learn abstract structure.
Large models are stronger than small models not just because they have more parameters and larger capacity. More specifically, they reduce head-to-head competition between common and rare tasks.
Those rare task signals that get briefly written then quickly washed out in small models may be exactly what large models truly learn more of.