DWDP: NVIDIA's Secret Weapon—Removing Synchronization Locks in MoE Inference Boosts NVL72 Throughput by 8.8%

Editor's Note: The biggest hidden killer in large MoE model inference isn't compute power; it's synchronous waiting. NVIDIA's latest paper, DWDP (arXiv:2604.01621), proposes "Distributed Weight Data Parallelism." This approach allows 72 GB200 GPUs to run independently with asynchronous prefetching of expert weights, completely eliminating collective communication barriers. In real-world tests on DeepSeek-R1, output throughput increased by 8.8%, and iteration latency decreased by 14.3%. The SGLang community has already followed suit; this technical route is rewriting the rules of MoE inference.

Root of the Problem: The MoE Synchronization Trap

Current mainstream MoE inference frameworks (TensorRT-LLM, SGLang, vLLM) typically adopt a combination strategy of Expert Parallelism (EP) + Tensor Parallelism (TP). The core contradiction lies in the following:

At the end of every layer, all GPU ranks must gather and wait at the All-to-All / All-Gather synchronization barrier.
Expert routing is naturally unbalanced—some GPUs become overloaded while others sit idle, waiting.
The ultra-high bandwidth of NVLink is wasted most of the time, only spiking momentarily during collective communication.

Result: If even a few of the 72 B200 GPUs lag slightly, the entire inference frame is dragged down.

DWDP's Breakthrough Approach

DWDP (Distributed Weight Data Parallelism) adopts a different philosophy:

Don't move the data; move the weights.

Each GPU stores only a portion of the expert weights (local experts), while attention layer weights are fully replicated. When inference computation requires a "remote expert," the system uses the CUDA copy engine to asynchronously prefetch and transfer the weights in advance.

Three key mechanisms:

1. No Collective Communication: Completely removes AllReduce/AllGather from the critical inference path. All 72 ranks execute completely asynchronously and independently, with no waiting.

2. Double-Buffer Prefetching: A dedicated CUDA stream concurrently prefetches the next batch of expert weights, fully overlapping with current GEMM computations to hide NVLink transmission latency.

3. Grouped GEMM Direct Consumption: Modifies the underlying GroupedGEMM operator to support TensorList. Weights do not need to be concatenated into a continuous buffer, eliminating one memory copy operation.

Real-World Data: DeepSeek-R1 × GB200 NVL72

Test Scenario: 8K input / 1K output, with a service load of 20–100 TPS per user.

Metric	Baseline (EP/TP)	DWDP	Change
Output TPS/GPU	1.00×	1.088×	+8.8%
Iteration Latency (Context Layer)	1.00×	0.857×	-14.3%

An 8.8% throughput boost is significant at the cluster scale. Extracting an extra 8.8% from each of the 72 GPUs is equivalent to gaining the compute power of approximately 6 additional GPUs out of thin air, achieving zero-cost expansion.

Industrial Adoption: SGLang Follows Suit as the Ecosystem Converges

DWDP has already been implemented in TensorRT-LLM (PR #12136), immediately sparking attention in the SGLang community. On April 4, SGLang developers raised Issue #22084, planning to port DWDP to the SGLang framework.

This means that both major open-source inference frameworks will natively support DWDP. The parallel paradigm for MoE inference is shifting from "synchronous collective communication" to "asynchronous distributed weights."

Current Limitation: It currently only supports single-node NVLink direct-connect environments; the cross-node RDMA version is under development.

A Deeper Look

The essence of DWDP is transforming a communication problem into a storage problem: trading NVLink bandwidth for the complete elimination of All-to-All synchronization. This shares the same lineage of thought as Mooncake using RDMA to blast through inference bottlenecks—the true battlefield for inference acceleration always lies on the side of the communication wall.

As the GB200 NVL72 becomes the new standard inference node, technologies like DWDP, which are deeply optimized for NVLink topology, will become mandatory for MoE inference.

Source: arXiv:2604.01621 | SGLang Issue #22084 | TensorRT-LLM PR #12136

Reference Links: