Ordinary Ethernet Cables Can Run Trillion-Parameter Models! Moonshot AI Unveils Breakthrough Architecture: No Need to Buy All H100s! 1T Model Test Shows 64% Latency Drop! The 'Siege' of Large Model Inference is Broken!

In the world of AI engineering, long-context inference has always been a "rich man's disease."

To make large models respond faster, vendors have been forced to cram thousands of expensive GPUs into the same data center, equipped with sky-high priced InfiniBand switches.

There is only one reason: KVCache (Key-Value Cache) is too heavy. As soon as data crosses out of the data center or over ordinary network cables, transmission latency instantly cripples the system, turning inference into "slow motion."

Must computing power only dance on expensive "isolated islands"?

Recently, Moonshot AI published a groundbreaking paper proposing the PrfaaS (Prefill-as-a-Service) architecture. They used a set of astonishing data to prove: Even without exorbitant network costs, relying on ordinary Ethernet cables, cross-datacenter scheduling for trillion-parameter models is achievable!

Targeting the Large-Scale LLM Inference Challenge: The KVCache Bandwidth Bottleneck

Friends following the AI circle should now be familiar with the concept of PD Separation.

Moonshot's paper, simply put, directs its sword at a very practical problem in large-scale LLM services:

How to efficiently separate Prefill and Decode across different data centers and heterogeneous hardware environments without being limited by KVCache transmission bandwidth.

In the past, cross-datacenter inference was considered "engineering suicide" because the KVCache of traditional models acts like a tsunami, instantly exploding bandwidth capacity.

The paper points out the reason: although traditional PD separation architectures separate the compute-intensive Prefill from the memory-bandwidth-intensive Decode, the massive amount of KVCache generated during the Prefill phase must be transmitted to Decode nodes via high-bandwidth networks (such as RDMA) quickly, or it will block inference. This leads to:

Prefill and Decode must be deployed within the same high-bandwidth network domain (e.g., a single data center).
Heterogeneous hardware (e.g., H100 for Prefill, H20 for Decode) is difficult to scale independently because KVCache cannot be efficiently transmitted over low-bandwidth networks.
Poor resource elasticity: Once the hardware ratio is fixed, it is difficult to adapt to changes in request length and cache hit rates.

Single Data Center Designs Have Major Flaws, While Cross-Datacenter Solutions Face Bottlenecks Like Bandwidth. How to Solve This?

Key Observation: Hybrid Attention Models Can Drastically Reduce KVCache

The paper points out that in new hybrid attention models (such as Kimi Linear, SWA + GQA), only a few full attention layers produce KVCache that grows with sequence length, while most linear complexity layers only produce fixed-size states.

Through modeling analysis, the team discovered:

KV Throughput (the size of KVCache generated per unit of time) is only 1/4 of dense models, and at its lowest can reach 1/36.

This order-of-magnitude reduction in KVCache is like: previously, transmitting data was like moving an entire mountain; now, it only requires shipping a single CD. It is akin to an algorithmic "physical slimming" of KVCache.

This makes transmitting KVCache across data centers via ordinary Ethernet possible.

Beyond this observation, the Kimi team also proposed a core strategy for building trillion-parameter models across data centers:

The core idea for cross-datacenter KVCache is not to outsource all prefill tasks, but to selectively extend the decoupled LLM service beyond a single cluster when "the acceleration benefit of remote prefill outweighs the transmission cost."

Core Logic of PrfaaS:

How to turn the inference "tsunami" into a "stream"?

So, beyond theoretical feasibility, how is PrfaaS actually implemented in engineering? It must be said that the team has indeed achieved dual innovation in "algorithm + system."

The overall logic of the PrfaaS-PD architecture is clear: distinguish the processing responsibilities between the local PD cluster and the PrfaaS cluster:

Dedicated PrfaaS Cluster: Executes compute-intensive long-context prefill (for uncached prefixes) on accelerators with high throughput and better cost-efficiency, and streams the generated KVCache to the local PD cluster via ordinary Ethernet;

Local PD Cluster: Handles short requests or requests with cache hits that are not bandwidth-friendly, and is responsible for Decode.

Note: Both communicate KVCache via ordinary Ethernet (e.g., VPC, leased lines).

The soul design of the dedicated PrfaaS cluster lies in the: Hybrid Prefix Cache Pool.

Although the KVCache of hybrid attention models has become smaller, its types have also become more diverse.

In hybrid models, the recurrent states of linear attention or SWA layers are request-level: their size is independent of input length, and they can only be reused when the cache length matches exactly.
In contrast, the KVCache of full attention layers is block-level: they grow linearly with input length and support partial prefix matching.

This heterogeneity poses a challenge to the traditional paradigm of uniform KVCache storage across all layers.

Obviously, the design of the Hybrid Prefix Cache Pool solves exactly this problem, while also enabling efficient cross-cluster and cross-datacenter KVCache transmission and reuse.

To explain simply how this is achieved due to space constraints: separate management, unified memory. The cache pool manages linear states and full attention KVCache separately, but these groups have aligned block sizes, allowing all groups to allocate and release blocks from a shared KVCache pool.

As a side note, this cache pool is an invention by the Kimi team based on the hybrid KVCache manager of vLLM. Interested friends can refer to the relevant papers.

Built-in Dual-Timescale Scheduling to Avoid Inference Stuttering

After solving this problem, the scheduling issue remains. PrfaaS does not naively offload all tasks. To this end, the research team built-in smart "分流" (traffic splitting) logic:

Selective Offloading: Only requests with incremental length greater than a threshold are sent to PrfaaS to avoid wasting cross-cluster bandwidth on short requests. "Short requests are digested locally; long requests are processed remotely." The system automatically identifies: only when the text is long enough (e.g., exceeding 19.4K tokens), is it dispatched to the remote high-compute center.
Bandwidth Awareness: Real-time monitoring of egress bandwidth and queue depth to dynamically adjust routing. Cache affinity is considered: if a cluster already has partial prefix caches, they are prioritized, with cross-cluster cache transmission occurring only when necessary.

Just as mobile navigation avoids congested roads, the scheduler monitors network speed. If the cable between two locations is "congested," it automatically adjusts the route to prioritize smooth local inference.

Furthermore, regarding scheduling strategies, the team also provided a dual-timescale scheduling strategy:

Short-term: Dynamically route requests based on bandwidth and cache distribution.
Long-term: Adjust the ratio of Prefill/Decode instances within the PD cluster based on traffic changes, re-optimizing the threshold t.

Hardware Deconstruction: Letting H200 Focus on the "Sprint"

Moreover, in actual testing, the Kimi team used H200s to form the PrfaaS cluster (specializing in computation, responsible for Prefill), while using an H20 cluster for Decode.

This "cross-datacenter combination" allows every chip to run in its comfort zone, solving the industry's awkward problem of "sufficient compute power, insufficient bandwidth."

Test Data: A "Dimensionality-Reducing Strike" for 1T Models

In this actual test targeting a 1-trillion parameter level hybrid architecture model (similar to Kimi Linear), PrfaaS delivered a report card capable of reshaping industry logic!

Specific deployment is as follows:
PrfaaS Cluster: 32 × H200 (High Compute)
Local PD Cluster: 64 × H20 (Bandwidth Optimized)
Cross-cluster Bandwidth: 100 Gbps Ethernet

First, compared to traditional deployment, the PrfaaS system throughput increased by 54%. This is 32% higher than heterogeneous PD without scheduling.

That's not all; results show that even under equal costs, PrfaaS system throughput still increases by approximately 15%.

Secondly, latency was significantly reduced: The P90 Time-To-First-Token (TTFT), representing user experience, dropped drastically by 64%.

More importantly, it successfully achieved cross-city top-tier compute scheduling. It is reported that when PrfaaS processes trillion-parameter models, cross-center bandwidth occupancy is only 13 Gbps (13% of 100 Gbps), far lower than the requirements of dense models.

This means you can schedule top-tier compute power between two cities using nothing more than an ordinary 100G network cable.

Ending the "GPU-Only" Doctrine: Ordinary Cables Can Schedule Global Compute

Large Models Enter the Era of "Eastern Data, Western Computing"

This is the fourth year of the large model sprint. In the context of increasingly scarce inference compute power, the emergence of the PrfaaS architecture by one of the Kimi teams is timely.

This paper not only proposes a distributed compute AI framework spanning cities and data centers but also offers much room for imagination regarding the future of AI inference.

In the editor's view, several points are worth discussing:

First, Kimi's research brings the true landing of "remote inference" a step closer. The concept of "Eastern Data, Western Computing" in the large model field has been proven entirely feasible in engineering: in the future, Prefill can be placed in the northwest where electricity is cheap, while Decode is placed in Beijing, Shanghai, Guangzhou, and Shenzhen close to users. This point alone is amazing.

Secondly, heterogeneous chips finally have hope for large-scale adoption. Must inference use only H100s? Of course not.

Everyone can also use domestic high-compute chips for Prefill centers and high-bandwidth chips for Decode centers. PrfaaS acts like "glue," allowing chips of different brands and regions to collaborate effectively.

Finally, there are the second-order effects. You might not feel deeply about terms like "throughput increase" or "latency reduction," but behind these lies a real reflection in everyone's "wallet."

Because improvements in these indicators, when reflected on the model side, mean doubled processing efficiency for 1T models, implying a significant drop in the cost of handling millions of context tokens. When reflected on the user side, it means a real drop in API prices!

In short, it is easy to predict that the model circle will soon undergo a transformation from "single data center" to "distributed compute cloud."

Moonshot AI's PrfaaS once again proves to the outside world with actual results: through the co-evolution of algorithms and engineering, ordinary network cables can also schedule global compute power! And the hope for lowering model subscription prices is greater than ever!

From this dimension, the popularization of AI has truly just begun.

Paper Address:

https://arxiv.org/pdf/2604.15039v1

—— Recommended Reads ——

Jensen Huang: The Entire World is Being "Reset"! AI Won't Massively Destroy Jobs, Work Hasn't Disappeared, Just "Tasks are Automated"! Faith in the American Dream is Vanishing! AI is Essentially a Five-Layer Structure

Will MCP Die in the Future? Anthropic Engineer: By 2026, the Core Capability of Agents is Connection! Three Major Improvements Solve MCP Context Bloat, Revealing Internal MCP Application: Agents No Longer Parasitic, Can Carry Their Own UI

Claude Code Engineer Reveals: 1 Million Token Context Window is a Double-Edged Sword, Context Corruption, Every Step is a Forking Point, Reveals Internal Best Practices: Use Backtracking Instead of Correction