Hello everyone, I'm Lao Zhang from AI Learning.
Regarding vLLM, I've written quite a bit before:
Gossip Time: Large Model Inference Engines, vLLM and SGLang Are Clashing
Omni-Modal Large Model Deployment: vLLM-Omni Arrives, 100% Open Source
Today, let's talk about four major updates densely released by vLLM in March 2026—Semantic Router v0.2 Athena, NVIDIA Nemotron 3 Super Launch, P-EAGLE Parallel Speculative Decoding, and the Model Runner V2 Architecture Overhaul. This wave of updates represents a full bloom from the underlying engine to upper-layer orchestration, solidifying vLLM's position as the foundation for large model inference in 2026.
I. Semantic Router v0.2 Athena: From Router to System Brain
The first to appear is vLLM Semantic Router v0.2 Athena.
If you're not familiar with Semantic Router, simply put—it's not a model; it's the intelligent routing layer that helps you decide "which model should handle this request."
From v0.1 Iris to v0.2 Athena, this upgrade is quite significant.
The diagram below provides an overview of Athena's overall architecture, showing the complete process from signal extraction to decision routing and model selection:
1. Complete Overhaul of the Model Stack
Athena has replaced its underlying foundation with the new multilingual long-context model mmbert-embed-32k-2d-matryoshka, supporting over 1,800 languages and a 32K context window. Built on top of this is a whole family of classifiers mom-multilingual-class, covering intent classification, jailbreak detection, PII detection, fact-checking, and feedback detection.
The figure below shows the new cross-modal embedding model multi-modal-embed-small, which can map text, images, and audio uniformly into the same 384-dimensional semantic space:
The performance improvement is immediate—a set of end-to-end tests were conducted on AMD MI300X:
| Request Size | ONNX+GPU Avg Latency | ONNX+CPU Avg Latency | Candle+CPU Avg Latency |
|---|---|---|---|
| ~500 tokens | 22 ms | 853 ms | 1053 ms |
| ~2000 tokens | 31 ms | 1814 ms | 1805 ms |
| ~8000 tokens | 128 ms | 4796 ms | 1830 ms |
ONNX+GPU is 40 times faster than the CPU solution. This isn't a theoretical test; it's the real routing link walking through Envoy→ext_proc→SR.
The figure below shows the full view of the Athena v0.2 model stack, where you can intuitively see the replacement of old and new foundations:
2. ClawOS: Turning the Router into an AI Operating System
This is Athena's boldest attempt. ClawOS transforms Semantic Router into an orchestration layer capable of managing multiple OpenClaw agent teams. You can create teams, assign Workers, and coordinate in real-time through natural language conversation—somewhat like building an "operating system" for AI agents.
The figure below shows the ClawOS Dashboard's multi-agent orchestration interface—where you can see the complete interface for team management, Worker assignment, and real-time chat collaboration:
Although still experimental, the direction is clear: future AI inference isn't just about "selecting models," but "managing teams."
3. Zero-Configuration Setup + Dashboard Driven
Previously, setting up Semantic Router required writing a bunch of YAML configurations. Now, it's done with a single command:
curl -fsSL https://vllm-semantic-router.com/install.sh | bashAfter installation, the Dashboard starts automatically; just go in, configure the model, and you're ready. The figure below shows the new Dashboard's initial startup guide interface:
The Dashboard can now not only configure routing but also visualize topology, replay routing decisions, and perform evaluation tests. It has truly become the "system brain":
4. AMD ROCm Yes!
AMD users are finally not second-class citizens anymore.
Athena has made ROCm a formally supported deployment path:
vllm-sr serve --platform amdThe figure below shows the AMD ROCm end-to-end deployment path, including GPU passthrough, ONNX acceleration, and CK Flash Attention support:
Lao Zhang Says: Semantic Router's ambitions are growing larger. From v0.1's "request routing" to v0.2's "system brain," vLLM is no longer just doing an inference engine but is moving into upper-layer orchestration. For production environments that need to run multiple models, this is something worth paying attention to.
II. NVIDIA Nemotron 3 Super: An MoE Model Born for Multi-Agent Systems
NVIDIA and vLLM have jointly launched official support for Nemotron 3 Super. Let's first look at a set of astonishing numbers:
Total Parameters: 120 Billion
Active Parameters: Only 12 Billion (MoE architecture, Latent MoE makes the inference cost of 4 experts equal to 1)
Context Window: 1 Million tokens
Supported GPUs: B200, H100, DGX Spark, RTX 6000
The figure below shows the Artificial Analysis evaluation comparison, where Nemotron 3 Super leads in both intelligence level and openness among open-source models of the same tier:
Why is it said to be "born for multi-agent systems"?
Multi-agent systems have two persistent problems:
Context Explosion: Multiple agents constantly pass historical records, tool outputs, and reasoning steps, causing tokens to roll larger and larger. Nemotron 3 Super solves this brute-force style with a 1 million token context window—it can hold the entire history, significantly reducing goal drift.
Inference Tax: Using large models for every sub-task is slow and expensive. The MoE architecture activates only 12 billion parameters, increasing throughput by up to 5 times compared to the previous generation. NVFP4 precision on Blackwell is 4 times faster than H100's FP8, with almost no loss in accuracy.
The figure below shows Nemotron 3 Super's leading position in both efficiency and accuracy dimensions:
Quick Start
After installing vLLM, you can deploy it with a single command:
pip install vllm==0.17.1
# BF16 precision, 4-card H100 configuration
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--trust-remote-code \
--served-model-name nemotron \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3Then you can call it using the standard OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:5000/v1", api_key="null")
resp = client.chat.completions.create(
model="nemotron",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me 3 bullet points about vLLM"}
],
temperature=0.7,
max_tokens=256,
)
print("Reasoning:", resp.choices[0].message.reasoning_content,
"\nContent:", resp.choices[0].message.content)It is worth noting that Nemotron 3 Super also supports Thinking Budget, allowing fine-grained control over token 开销 during inference—not all tasks require deep thinking; simple tasks should be used sparingly.
Lao Zhang Says: Nemotron 3 Super's positioning is very precise—it doesn't pursue the strongest single-point capability but finds the optimal solution on the Pareto frontier of "efficiency × accuracy." With 120B total parameters activating only 12B, coupled with a million-token context, it is custom-built for multi-agent workflows. If you are working on Agent orchestration or Tool-Calling Pipelines, this model is worth a serious evaluation.
III. P-EAGLE: Speculative Decoding Speeds Up Again, Handling All Draft Tokens in One Forward Pass
Speculative Decoding is currently one of the most effective technical directions for accelerating large model inference. The EAGLE series is the SOTA method in this field, and vLLM has been deeply integrating it. However, EAGLE has an unavoidable bottleneck—draft generation is autoregressive. If you want to predict K tokens, you have to run K forward passes. When you want to predict more, the draft model's own latency becomes the new bottleneck.
Let's look at the results first—the figure below shows the P-EAGLE performance comparison on SPEED-BENCH on NVIDIA B200, where the gap is obvious at a glance:
P-EAGLE's solution is very direct: change autoregressive draft generation to parallel generation—output all K draft tokens in one forward pass.
How is it done?
The figure below is the architectural principle diagram of P-EAGLE; the left side shows the traditional autoregressive method of EAGLE, and the right side shows the parallel method of P-EAGLE:
In the prefill phase, P-EAGLE is the same as ordinary EAGLE, capturing the hidden states of the target model. The key is in the second step—the draft generation phase:
For the next token (NTP), the input is exactly the same as standard EAGLE.
For positions 2 to K (MTP), the token embeddings and hidden states do not exist yet. What to do? P-EAGLE introduces two learnable parameters: a shared mask token embedding and a shared hidden state
h_sharedas placeholders.
All positions pass through N layers of Transformers together, outputting all draft tokens at once.
Challenges in Long-Sequence Training
The biggest challenge in training P-EAGLE is memory. The figure below shows the sequence length distribution of GPT-OSS 120B on the UltraChat dataset—median 3891 tokens, P90 reaching 10800 tokens:
Training K parallel groups on a sequence of length N will generate N×K positions. When N=8192 and K=8, a single training sample has 65,536 positions, and the attention matrix requires 8GB. P-EAGLE solves this problem through a sequence partitioning algorithm.
Actual Test Results
Detailed results of three sets of benchmark tests are as follows:
Throughput comparison under different concurrency levels on MT-Bench, where P-EAGLE leads at all concurrency levels:
In HumanEval code synthesis tasks, P-EAGLE's advantage remains obvious at high concurrency:
In SPEED-Bench long-text code generation tasks, P-EAGLE achieves a speedup of up to 1.69x at c=1:
A very interesting finding: P-EAGLE reaches peak performance at K=7, while EAGLE-3 caps out at K=3. Because regardless of how large K is in parallel generation, the number of forward passes is always 1—the deeper the speculation, the greater P-EAGLE's advantage.
The comparison of Accepted Length (AL) explains the issue even better. At K=7:
HumanEval: P-EAGLE 3.94 vs EAGLE-3 3.03 (30% higher)
SPEED-Bench: 3.38 vs 2.59 (31% higher)
MT-Bench: 3.70 vs 3.27 (13% higher)
Usage Method
Only two steps are needed:
Download (or train) the parallel draft head; pre-trained versions for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B are already available on HuggingFace.
Add a configuration parameter:
vllm serve openai/gpt-oss-20b \
--speculative-config '{"method": "eagle3", "model": "amazon/gpt-oss-20b-p-eagle", "num_speculative_tokens": 5, "parallel_drafting": true}'It's that simple; "parallel_drafting": true gets it done in one line.
Lao Zhang Says: P-EAGLE's approach is very elegant—since the draft model's sequence generation is the bottleneck, let's not generate sequences. Use learnable placeholders + parallel Transformers to get it done in one go. The cost is the need to retrain the draft head, but Amazon has released several pre-trained versions. For production environments pursuing extreme latency, this upgrade is very worth trying.
IV. Model Runner V2: A Complete Overhaul of the vLLM Core Engine
If the previous three updates were "adding on top of vLLM," then Model Runner V2 (MRV2) is a complete rewrite of the vLLM core engine.
This is the largest architectural upgrade since the release of vLLM V1 last year. The official statement is quite blunt: V1's model runner accumulated a lot of technical debt, with persistent state and model input coupling, asynchronous scheduling being a patch added later, the CPU doing too much work that should be done by the GPU, and the code becoming increasingly difficult to maintain.
MRV2 is rebuilt around three core principles: modularity, GPU-native, and async-first.
1. Better Persistent Batching + GPU-Native Input Preparation
V1 directly used persistent state as model input, leading to layout constraints and complex state management. The figure below shows the problem in V1 where request order is tightly coupled with Block Table layout:
MRV2 decouples persistent request state from per-step input tensors—each active request has a stable row in a fixed-size state table, and inputs are extracted from it according to the current order at each step. The figure below clearly shows how the new design generates correctly ordered inputs via gather operations:
More critically, input preparation has been moved to the GPU, completed using Triton Kernels. Tensors like input_ids, positions, query_start_loc, and seq_lens are now built directly on the GPU, bypassing the CPU.
2. Async-First Design
V1's asynchronous scheduling was "added later," while MRV2 takes it as a core design constraint—the goal is zero synchronization between CPU and GPU.
The figure below shows the standard asynchronous scheduling timeline, where the CPU prepares for step N+1 while the GPU executes step N:
The most direct benefit: asynchronous scheduling and speculative decoding can finally coexist cleanly. The figure below shows how MRV2 consumes rejection sampling results directly via GPU-side input preparation, eliminating all synchronization points:
3. Triton-Native Sampler
MRV2 rewrites the sampling logic:
Gumbel-Max sampling kernel, avoiding explicit softmax calculations.
More efficient top-k logprobs, finding top-k logits first before calculating logprobs.
More memory-efficient prompt logprobs, supporting chunked processing within a single prompt.
Better compatibility with speculative decoding.
4. Stronger Modularity
V1's gpu_model_runner.py had already swollen to 6,700 lines. MRV2 introduces the ModelState abstract interface:
class ModelState(ABC):
def add_request(self, ...):
def remove_request(self, ...):
def get_mm_embeddings(self, ...):
def prepare_inputs(self, ...):
def prepare_attn(self, ...):
def prepare_dummy_inputs(self, ...):
...This separates model-specific logic (multimodal embeddings, extra inputs, attention metadata) from the general execution path. The largest file is now controlled to be under 1,300 lines.
This is extremely important for developers of different model series like DeepSeek, Qwen, and Kimi—you only need to care about your model's ModelState, without reading thousands of lines of irrelevant code.
Performance Testing
Running the small model Qwen3-0.6B on GB200 (intentionally using a small model to amplify the impact of CPU overhead), throughput jumped directly from 16K to 25K:
In speculative decoding scenarios: 4-card GB200 + GLM-4.7-FP8 + MTP=1, TPOT reduced by 6.3%:
The improvement comes from the zero-synchronization design—CPU-GPU synchronization points are completely eliminated after enabling speculative decoding.
Try It Now
export VLLM_USE_V2_MODEL_RUNNER=1
# Then use vLLM normally, no code changes neededHowever, note that MRV2 is currently experimental. In v0.18.0, several features are not yet supported: linear attention models (Qwen3.5, Nemotron 3 Super), speculative decoding methods other than Eagle/Eagle3/MTP, LoRA, etc.
Lao Zhang Says: MRV2 is a "bone-breaking" refactoring, but the direction is absolutely correct. Moving input preparation to the GPU, achieving zero-sync asynchronous scheduling, and introducing ModelState decoupling—these improvements are not just "icing on the cake" but lay the foundation for future complex scenarios involving heterogeneous models + speculative decoding + multimodal coexistence. The 56% throughput increase is just the beginning; as more features migrate to MRV2, the benefits will continue to be released.
Summary: vLLM March 2026 Panorama
| Update | Release Date | One-Sentence Summary |
|---|---|---|
| Semantic Router v0.2 Athena | March 10 | Evolved from a router to a system brain for multi-model orchestration |
| Nemotron 3 Super | March 11 | 120B total params/12B active, an MoE model tailor-made for multi-agents |
| P-EAGLE | March 13 | Handles all draft tokens in one forward pass; speculative decoding no longer has sequence bottlenecks |
| Model Runner V2 | March 24 | Complete overhaul of vLLM core engine: GPU-native + zero-sync + strong modularity |
Looking at these four releases together, vLLM's strategic intent is very clear:
Underlying Layer—MRV2 rebuilds the engine foundation, preparing for more complex inference demands.
Acceleration—P-EAGLE breaks through the ceiling again in the key optimization direction of speculative decoding.
Models—Nemotron 3 Super fills the ecological niche for efficient MoE models.
Upper Layer—Semantic Router Athena begins handling multi-model orchestration and agent scheduling.
From "inference engine" to "inference platform," vLLM is completing a leap from a tool to an ecosystem.
Relevant Links:
Semantic Router v0.2 Athena: https://vllm.ai/blog/v0.2-vllm-sr-athena-release
Nemotron 3 Super: https://vllm.ai/blog/nemotron-3-super
P-EAGLE: https://vllm.ai/blog/p-eagle
Model Runner V2: https://vllm.ai/blog/mrv2
vLLM Official Site: https://vllm.ai
Semantic Router GitHub: https://github.com/vllm-project/semantic-router
#vLLM #LargeModelInference #SpeculativeDecoding #Nemotron #SemanticRouter
Production is not easy; if you find this article useful, could you please follow? Give me a triple hit: Like, Share, and View. If you can add a star 🌟, thank you for reading my article, see you in the next one!