Hardcore: Google's Jeff Dean Says the Bottleneck for Million-Chip LLM Pre-training Has Been Completely Broken!

Imagine you have a million-chip training cluster, and each chip fails about once a year. Sounds reliable? But a million chips mean the entire cluster will experience a failure, on average, less than every minute. Today's training method is: if one machine fails, everyone stops and waits. At a large enough scale, this is simply unsustainable.

Google's latest paper, Decoupled DiLoCo, proposes a different approach: Don't wait. Let everyone train on their own and merge asynchronously.

Jeff Dean, a key contributor to this work, is a Chief Scientist at Google and a core technical leader at Google DeepMind. He was instrumental in the development of BigTable, MapReduce, and TensorFlow.

I was delighted to advise and guide the team that built the decoupled DiLoCo training system. It elegantly handles failures at large scale, allowing (N-1)/N units to continue when one fails.

The Fatal Flaw in Large Model Pre-training

Today's pre-training for large models relies on the SPMD (Single Program Multiple Data) paradigm—all chips must be strictly synchronized, and every step waits for everyone to arrive. The authors use the CAP theorem from distributed systems to analogize this problem:

Consistency (C): All chips maintain perfectly synchronized model weights
Availability (A): Training continues even if hardware fails
Partition Tolerance (P): Training continues even if the network is unstable

The current approach is "consistency-first"—sacrificing availability and partition tolerance to keep all chips in lockstep. The result: A single machine failure stops the entire cluster in its tracks.

Comparison between resilient data parallelism and decoupled approach

A cluster's failure frequency has a simple formula: MTBF_cluster = MTBF_chip / N_chip. The more chips, the more fragile the entire cluster becomes. With 1.5 million chips, each failing once a year, the cluster experiences a failure every 5 minutes on average.

Core Method: Splitting a Large Cluster into Independent "Learners"

Decoupled DiLoCo splits the entire training cluster into M independent learners. Each learner runs the AdamW optimizer locally on its own shard of data. Workers are completely isolated from one another and do not communicate directly.

A central syncer handles asynchronous aggregation: it doesn't wait for everyone to arrive. As soon as K learners have checked in (K can be set as low as 1), it begins merging the updates. The merging uses a token-weighted average—learners that have processed more data but taken fewer optimizer steps receive a higher weight (quality × quantity). There's also RDA (Radial-Direction Averaging), which averages the direction and magnitude of gradients separately, ensuring the gradient norm doesn't fluctuate wildly when merging different numbers of learners.

There is another clever design: a grace period. If the network has idle time, the syncer waits a bit longer for more learners to catch up, essentially trading spare bandwidth for better sample efficiency, without slowing down the overall pace.

In the system architecture, the syncer runs on CPU machines (no GPU/TPU required), and each learner is a separate TPU partition. Learners share no accelerator resources, so a failure in one doesn't cascade to others. The entire system is orchestrated by Google's Pathways.

Key Results: 88% Effective Compute vs. 58%, Model Quality Remains Identical

Under an extreme simulated failure scenario of 1.2 million chips, with MTBF=1 year per chip:

Decoupled DiLoCo (M=8) achieves 88% effective compute
Resilient data parallelism only achieves 58%
With a larger M, it can achieve 100% uptime, with zero downtime.

In terms of model quality, on the Gemma 4 architecture's Dense 2B/5B/9B and MoE 2.8B/3.8B models, the downstream performance on text and vision benchmarks is completely comparable to that of synchronous training. Even after post-training (SFT + RLHF), the results from all three pre-training methods are virtually identical.

Resilience comparison under hardware failures

Scalability of models at different sizes

Three Additional Capabilities: Heterogeneous Co-training, Dynamic Scaling, and Cross-Region Training

Heterogeneous Chip Co-training: Mixing TPUv5e and TPUv5p, with an 18% native speed difference and an additional 10% random variance injected. The ML performance with K=1 and a grace window is identical to the fully synchronous K=8 setup—it is no longer held back by the slowest chip.

Dynamic Scaling (Scavenging): Starting with a base of M=4 learners, temporarily scaling up to M=8 or M=16. Under Iso-FLOPs conditions, this accelerates training without degrading model quality. It’s like getting a “free lunch” from temporarily idle compute power to speed things up.

Cross-Region Training: 8 learners are distributed across the globe. Standard data parallelism becomes unusable (10–20 times slower), while Decoupled DiLoCo is almost unaffected. The bandwidth requirement is two orders of magnitude lower than that of data parallelism.

In Conclusion

The larger the scale, the more appealing asynchronous training becomes. The authors explicitly state that Decoupled DiLoCo’s model quality, relative to data parallelism, gets better with scale, and its systemic advantages (fault tolerance, bandwidth, heterogeneity) are also amplified at larger scales. This is a "bitter lesson"-style conclusion—the simpler, more scalable method ultimately wins.

Current experiments go up to 9B parameters, with a slight drop in ML performance at M=16, suggesting there's an upper limit to the number of learners. But as training moves toward cross-region and cross-generational chip deployments, **prioritizing availability will shift from an "advantage" to a "must"**.

Paper Title: Decoupled DiLoCo for Resilient Distributed Pre-trainingPaper Link: https://arxiv.org/abs/2604.21428v1