Exclusive! DeepSeek Partners with Tsinghua and Peking University to Release DualPath System, Boosting AI Inference Infra Throughput by 196% on Average

Exclusive! DeepSeek Partners with Tsinghua and Peking University to Release DualPath System

Just moments after the leak of DeepSeek V4 Lite, the DeepSeek team has once again made headlines by publishing a major AI research paper in collaboration with Tsinghua University and Peking University: breaking through the storage bandwidth bottleneck in Agentic LLM inference.

On February 26, it was announced that the DeepSeek team, jointly with the School of Computer Science at Tsinghua University and Peking University, has developed DualPath, an inference framework system for Agentic AI Large Language Models (LLMs).

By introducing a dual-path KV-Cache loading mechanism, DualPath overcomes previous technical bottlenecks. Furthermore, DualPath supports a novel storage-to-decode path, where the KV-Cache is loaded into the decode engine and then efficiently transferred to the prefill engine via RDMA over the compute network. DualPath combines this optimized data path—which effectively avoids network congestion and prevents interference with latency-sensitive model execution communications—with a global scheduler capable of dynamically balancing loads between prefill and decode engines.

Through three Agent tests, DualPath achieved a throughput improvement of up to 1.87x in offline AI inference. In online scenarios, DualPath increased the average online serving throughput by 1.96x.

The related paper has been published on arXiv. The first author is Yongtong Wu, a PhD student at the School of Computer Science, Peking University, and a member of the DeepSeek-AI Systems Group.

This marks the first and latest research outcome jointly released by DeepSeek with these two prestigious universities, and it is also DeepSeek's first product release specifically addressing the AI inference storage bottleneck.

DualPath System Architecture Diagram

arXiv: https://arxiv.org/abs/2602.21548

In reality, the performance of multi-turn, Agentic LLM inference is increasingly limited by KV-Cache storage I/O rather than computation. In the currently popular decoupled architecture, loading massive amounts of KV-Cache from external storage creates a fundamental imbalance: the storage NIC bandwidth on the prefill engine becomes saturated, while the storage NIC on the decode engine remains idle. This asymmetry severely limits the system's overall throughput.

Storage Bandwidth Bottleneck Illustration

The paper notes that AI data centers are logical supercomputers designed specifically to handle large-scale generative AI training and inference workloads. For instance, in a standard NVIDIA DGX SuperPOD, each node is equipped with 8 Hopper GPUs interconnected via high-speed NVLink. Each GPU is paired with a dedicated 400 Gbps compute NIC to maximize inter-node communication bandwidth. In addition to the compute interconnect architecture, each node is also equipped with a Storage NIC (SNIC, also known as a North-South NIC) with speeds up to 400 Gbps, enabling rapid access to datasets, model checkpoints, and KV caches on disk.

However, the team observed that in AI inference tasks based on Agents, GPU utilization is severely insufficient. Research indicates that the KV cache loading speed has become the bottleneck, caused by the limited bandwidth of the single storage NIC on each node.

Therefore, analysis suggests that the following three decisive factors jointly contribute to this bottleneck issue:

First, Agent workloads exhibit a high KV cache hit rate, requiring more I/O operations and less computation, leading to severe I/O bottlenecks. Agent-type workloads naturally feature long contexts, short appends, and multi-turn interactions. In each turn, the GPU must read the entire context's KV cache from persistent storage and perform prefill calculations on the appended tokens. Trajectory data collected from representative coding tasks shows an average of 157 turns, indicating that LLMs tend to engage in multi-turn interactions. The average context length is 32.7k, while the average append length is only 429, meaning the KV cache hit rate is 98.7%.

In this scenario, the cache-to-compute ratio, defined as the ratio of KV cache size to the amount required for loading and computation, is approximately 22 GB/PFLOP for DeepSeek-V3.2, posing a significant bottleneck for storage bandwidth. It is worth noting that the KV cache size of the DeepSeek MLA model has already been highly optimized; for models with larger KV cache sizes, the situation would be even worse. This ratio for DeepSeek-V3.2 is higher than that of DeepSeek-V3, thanks to its sparse attention design which reduces computational requirements.

Second, hardware evolution trends are not well-suited for agent inference workloads. In recent years, the growth rate of network bandwidth and HBM capacity has lagged behind the growth of GPU floating-point computing power, causing us to frequently encounter memory and communication bottlenecks under agent workloads. As shown in Figure 3, from NVIDIA Ampere to Blackwell, the I/O to compute ratio has dropped by 14.4 times. Low NIC bandwidth limits the KV cache loading speed, leaving GPUs idle. Furthermore, smaller HBM capacity limits the token batch size for GPU kernels, preventing simultaneous computation and hindering the full utilization of compute units like tensor cores.

Third, existing LLM inference systems suffer from severe imbalance in storage network utilization across different engine types. In common data-distributed systems, the KV cache for hit tokens is loaded directly from remote storage entirely by the prefill engine. This design concentrates all storage I/O pressure onto the SNIC on the prefill side, while the SNIC on the decode engine remains largely idle. Consequently, the total bandwidth of the storage network cannot be fully utilized.

The above analysis indicates that the fundamental performance issue facing agent inference based on PD decomposition architecture is the high I/O demand for KV cache retrieval and the uneven utilization of storage network bandwidth between inference engines. It was also observed that the network traffic of the compute network—whose aggregated bandwidth is far greater than that of the storage network—exhibits an intermittent pattern: collective operations used in model inference appear in bursts within sub-millisecond intervals.

Therefore, an opportunity naturally emerges: the SNIC bandwidth of decode nodes can be utilized to load KV caches from storage and transfer them back to prefill nodes, thereby leveraging the idle bandwidth of the faster compute network.

DualPath Data Flow Optimization

Performance Comparison Chart

Consequently, DeepSeek, in collaboration with teams from Peking University and Tsinghua University, developed DualPath, which features three core innovations:

  1. Adoption of an optimized dual-path loading data path design that does not introduce inherent congestion under common P/D ratios.
  2. A NIC-centric traffic management approach to isolate KV cache traffic from latency-sensitive model inference communications.
  3. Implementation of a new dynamic scheduling strategy to collaboratively balance compute and network utilization between prefill and decode engines.

Ultimately, the team implemented DualPath on top of the AI inference stack and evaluated it using representative agent workloads with long contexts and high cache reuse. Experiments demonstrate that DualPath significantly improves system throughput and time-to-first-token latency while maintaining constant token-to-token latency. In Agent inference scenarios, DualPath increases end-to-end throughput for offline inference by up to 1.87x and improves average online serving throughput by 1.96x.

It is worth mentioning that in the past 48 hours, DeepSeek's unreleased V4 new model has sparked heated discussion in the AI community. Multiple independent sources claim that the DeepSeek V4 Lite test results show a significant improvement over the V3.2 version, with the model supporting 1M context + native multimodal capabilities; its initial generated SVG examples have leaked and are being widely circulated. Currently, the model is being tested by chip manufacturers such as Huawei.

According to multiple reporting sources, the DeepSeek-V4 version model, boasting over 660 billion parameters, is expected to be released as early as next week.

The paper's first author, Yongtong Wu, is a PhD student at Peking University (PKU) (speculated to be born post-2000), supervised by Professor Xin Jin in system software, particularly LLM infrastructure. Previously, he obtained his Bachelor's degree in Information and Computer Science from Peking University in 2025, where he worked under Assistant Professor Qun Huang of the Department of Computer Science and Technology at PKU on RDMA middleware development.

In July 2025, Yongtong Wu joined the DeepSeek Systems Group, primarily dedicated to building inference infrastructure for next-generation DeepSeek models. One of his key tasks involves optimizing large-scale internal software systems to ensure optimal performance across various hardware platforms (understood as Infra).

https://jokerwyt.github.io/

Photo of Yongtong Wu

Another paper author: Xin Jin

PhD supervisor and Tenure-Track Assistant Professor at Peking University. He graduated from the Department of Computer Science at Peking University in 2011 and received his PhD from the Chinese University of Hong Kong in 2015. He previously worked at Huawei's Future Network Theory Lab (2015-2017) and the Institute of Computing Technology, Chinese Academy of Sciences (2017-2020), before joining Peking University in May 2020.

Xin Jin's main research directions include distributed stream processing and network measurement. He has published numerous papers in top-tier conferences in the field of networks and systems, including SIGCOMM, INFOCOM, VLDB, and USENIX ATC, and has led sub-projects of the National Key R&D Program and projects funded by the National Natural Science Foundation of China.

As of 2025, the team he supervises has achieved multiple results in big data system design and algorithm optimization, including two papers at the ICDE 2023 conference and first prize in the National College Student Information Storage Technology Competition.

Photo of Xin Jin


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.