vLLM's Hardcore Quadruple Release!

Hello everyone, I'm Lao Zhang from AI Learning.

Regarding vLLM, I've written quite a bit before:

Today, let's talk about four major updates densely released by vLLM in March 2026—Semantic Router v0.2 Athena, NVIDIA Nemotron 3 Super Launch, P-EAGLE Parallel Speculative Decoding, and the Model Runner V2 Architecture Overhaul. This wave of updates represents a full bloom from the underlying engine to upper-layer orchestration, solidifying vLLM's position as the foundation for large model inference in 2026.

I. Semantic Router v0.2 Athena: From Router to System Brain

The first to appear is vLLM Semantic Router v0.2 Athena.

If you're not familiar with Semantic Router, simply put—it's not a model; it's the intelligent routing layer that helps you decide "which model should handle this request."

From v0.1 Iris to v0.2 Athena, this upgrade is quite significant.

The diagram below provides an overview of Athena's overall architecture, showing the complete process from signal extraction to decision routing and model selection:

Athena Overall Architecture

1. Complete Overhaul of the Model Stack

Athena has replaced its underlying foundation with the new multilingual long-context model mmbert-embed-32k-2d-matryoshka, supporting over 1,800 languages and a 32K context window. Built on top of this is a whole family of classifiers mom-multilingual-class, covering intent classification, jailbreak detection, PII detection, fact-checking, and feedback detection.

The figure below shows the new cross-modal embedding model multi-modal-embed-small, which can map text, images, and audio uniformly into the same 384-dimensional semantic space:

Cross-Modal Embedding Model

The performance improvement is immediate—a set of end-to-end tests were conducted on AMD MI300X:

Request Size	ONNX+GPU Avg Latency	ONNX+CPU Avg Latency	Candle+CPU Avg Latency
~500 tokens	22 ms	853 ms	1053 ms
~2000 tokens	31 ms	1814 ms	1805 ms
~8000 tokens	128 ms	4796 ms	1830 ms

ONNX+GPU is 40 times faster than the CPU solution. This isn't a theoretical test; it's the real routing link walking through Envoy→ext_proc→SR.

The figure below shows the full view of the Athena v0.2 model stack, where you can intuitively see the replacement of old and new foundations:

Athena Model Stack Overview

2. ClawOS: Turning the Router into an AI Operating System

This is Athena's boldest attempt. ClawOS transforms Semantic Router into an orchestration layer capable of managing multiple OpenClaw agent teams. You can create teams, assign Workers, and coordinate in real-time through natural language conversation—somewhat like building an "operating system" for AI agents.

The figure below shows the ClawOS Dashboard's multi-agent orchestration interface—where you can see the complete interface for team management, Worker assignment, and real-time chat collaboration:

ClawOS Multi-Agent Orchestration Interface

Although still experimental, the direction is clear: future AI inference isn't just about "selecting models," but "managing teams."

3. Zero-Configuration Setup + Dashboard Driven

Previously, setting up Semantic Router required writing a bunch of YAML configurations. Now, it's done with a single command:

curl -fsSL https://vllm-semantic-router.com/install.sh | bash

After installation, the Dashboard starts automatically; just go in, configure the model, and you're ready. The figure below shows the new Dashboard's initial startup guide interface:

Dashboard Initial Startup Guide

The Dashboard can now not only configure routing but also visualize topology, replay routing decisions, and perform evaluation tests. It has truly become the "system brain":

Dashboard System Brain

4. AMD ROCm Yes!

AMD users are finally not second-class citizens anymore.

Athena has made ROCm a formally supported deployment path:

vllm-sr serve --platform amd

The figure below shows the AMD ROCm end-to-end deployment path, including GPU passthrough, ONNX acceleration, and CK Flash Attention support:

AMD ROCm Deployment Architecture

Lao Zhang Says: Semantic Router's ambitions are growing larger. From v0.1's "request routing" to v0.2's "system brain," vLLM is no longer just doing an inference engine but is moving into upper-layer orchestration. For production environments that need to run multiple models, this is something worth paying attention to.

II. NVIDIA Nemotron 3 Super: An MoE Model Born for Multi-Agent Systems

NVIDIA is making moves; the new model breaks into the top five on the OpenClaw success rate leaderboard and is currently free to use

NVIDIA and vLLM have jointly launched official support for Nemotron 3 Super. Let's first look at a set of astonishing numbers:

Total Parameters: 120 Billion
Active Parameters: Only 12 Billion (MoE architecture, Latent MoE makes the inference cost of 4 experts equal to 1)
Context Window: 1 Million tokens
Supported GPUs: B200, H100, DGX Spark, RTX 6000

The figure below shows the Artificial Analysis evaluation comparison, where Nemotron 3 Super leads in both intelligence level and openness among open-source models of the same tier:

Nemotron 3 Super Artificial Analysis Comparison

Why is it said to be "born for multi-agent systems"?

Multi-agent systems have two persistent problems:

Context Explosion: Multiple agents constantly pass historical records, tool outputs, and reasoning steps, causing tokens to roll larger and larger. Nemotron 3 Super solves this brute-force style with a 1 million token context window—it can hold the entire history, significantly reducing goal drift.
Inference Tax: Using large models for every sub-task is slow and expensive. The MoE architecture activates only 12 billion parameters, increasing throughput by up to 5 times compared to the previous generation. NVFP4 precision on Blackwell is 4 times faster than H100's FP8, with almost no loss in accuracy.

The figure below shows Nemotron 3 Super's leading position in both efficiency and accuracy dimensions:

Nemotron 3 Super Efficiency vs Accuracy

Quick Start

After installing vLLM, you can deploy it with a single command:

pip install vllm==0.17.1

# BF16 precision, 4-card H100 configuration
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --served-model-name nemotron \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3

Then you can call it using the standard OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:5000/v1", api_key="null")

resp = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me 3 bullet points about vLLM"}
    ],
    temperature=0.7,
    max_tokens=256,
)
print("Reasoning:", resp.choices[0].message.reasoning_content, 
"\nContent:", resp.choices[0].message.content)

It is worth noting that Nemotron 3 Super also supports Thinking Budget, allowing fine-grained control over token 开销 during inference—not all tasks require deep thinking; simple tasks should be used sparingly.

Lao Zhang Says: Nemotron 3 Super's positioning is very precise—it doesn't pursue the strongest single-point capability but finds the optimal solution on the Pareto frontier of "efficiency × accuracy." With 120B total parameters activating only 12B, coupled with a million-token context, it is custom-built for multi-agent workflows. If you are working on Agent orchestration or Tool-Calling Pipelines, this model is worth a serious evaluation.

III. P-EAGLE: Speculative Decoding Speeds Up Again, Handling All Draft Tokens in One Forward Pass

Speculative Decoding is currently one of the most effective technical directions for accelerating large model inference. The EAGLE series is the SOTA method in this field, and vLLM has been deeply integrating it. However, EAGLE has an unavoidable bottleneck—draft generation is autoregressive. If you want to predict K tokens, you have to run K forward passes. When you want to predict more, the draft model's own latency becomes the new bottleneck.

Let's look at the results first—the figure below shows the P-EAGLE performance comparison on SPEED-BENCH on NVIDIA B200, where the gap is obvious at a glance:

P-EAGLE SPEED-BENCH Performance Comparison

P-EAGLE's solution is very direct: change autoregressive draft generation to parallel generation—output all K draft tokens in one forward pass.

How is it done?

The figure below is the architectural principle diagram of P-EAGLE; the left side shows the traditional autoregressive method of EAGLE, and the right side shows the parallel method of P-EAGLE:

P-EAGLE Architectural Principle

In the prefill phase, P-EAGLE is the same as ordinary EAGLE, capturing the hidden states of the target model. The key is in the second step—the draft generation phase:

For the next token (NTP), the input is exactly the same as standard EAGLE.
For positions 2 to K (MTP), the token embeddings and hidden states do not exist yet. What to do? P-EAGLE introduces two learnable parameters: a shared mask token embedding and a shared hidden state h_shared as placeholders.

All positions pass through N layers of Transformers together, outputting all draft tokens at once.

Challenges in Long-Sequence Training

The biggest challenge in training P-EAGLE is memory. The figure below shows the sequence length distribution of GPT-OSS 120B on the UltraChat dataset—median 3891 tokens, P90 reaching 10800 tokens:

Sequence Length Distribution

Training K parallel groups on a sequence of length N will generate N×K positions. When N=8192 and K=8, a single training sample has 65,536 positions, and the attention matrix requires 8GB. P-EAGLE solves this problem through a sequence partitioning algorithm.

Actual Test Results

Detailed results of three sets of benchmark tests are as follows:

Throughput comparison under different concurrency levels on MT-Bench, where P-EAGLE leads at all concurrency levels:

MT-Bench Throughput Comparison

In HumanEval code synthesis tasks, P-EAGLE's advantage remains obvious at high concurrency:

HumanEval Throughput Comparison

In SPEED-Bench long-text code generation tasks, P-EAGLE achieves a speedup of up to 1.69x at c=1:

Speed-Bench Throughput Comparison

A very interesting finding: P-EAGLE reaches peak performance at K=7, while EAGLE-3 caps out at K=3. Because regardless of how large K is in parallel generation, the number of forward passes is always 1—the deeper the speculation, the greater P-EAGLE's advantage.

The comparison of Accepted Length (AL) explains the issue even better. At K=7:

HumanEval: P-EAGLE 3.94 vs EAGLE-3 3.03 (30% higher)
SPEED-Bench: 3.38 vs 2.59 (31% higher)
MT-Bench: 3.70 vs 3.27 (13% higher)

Usage Method

Only two steps are needed:

Download (or train) the parallel draft head; pre-trained versions for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B are already available on HuggingFace.
Add a configuration parameter:

vllm serve openai/gpt-oss-20b \
  --speculative-config '{"method": "eagle3", "model": "amazon/gpt-oss-20b-p-eagle", "num_speculative_tokens": 5, "parallel_drafting": true}'

It's that simple; "parallel_drafting": true gets it done in one line.

Lao Zhang Says: P-EAGLE's approach is very elegant—since the draft model's sequence generation is the bottleneck, let's not generate sequences. Use learnable placeholders + parallel Transformers to get it done in one go. The cost is the need to retrain the draft head, but Amazon has released several pre-trained versions. For production environments pursuing extreme latency, this upgrade is very worth trying.

IV. Model Runner V2: A Complete Overhaul of the vLLM Core Engine

If the previous three updates were "adding on top of vLLM," then Model Runner V2 (MRV2) is a complete rewrite of the vLLM core engine.

This is the largest architectural upgrade since the release of vLLM V1 last year. The official statement is quite blunt: V1's model runner accumulated a lot of technical debt, with persistent state and model input coupling, asynchronous scheduling being a patch added later, the CPU doing too much work that should be done by the GPU, and the code becoming increasingly difficult to maintain.

MRV2 is rebuilt around three core principles: modularity, GPU-native, and async-first.

1. Better Persistent Batching + GPU-Native Input Preparation

V1 directly used persistent state as model input, leading to layout constraints and complex state management. The figure below shows the problem in V1 where request order is tightly coupled with Block Table layout:

V1 Persistent Batching Design

MRV2 decouples persistent request state from per-step input tensors—each active request has a stable row in a fixed-size state table, and inputs are extracted from it according to the current order at each step. The figure below clearly shows how the new design generates correctly ordered inputs via gather operations:

MRV2 Persistent Batching Design

More critically, input preparation has been moved to the GPU, completed using Triton Kernels. Tensors like input_ids, positions, query_start_loc, and seq_lens are now built directly on the GPU, bypassing the CPU.

2. Async-First Design

V1's asynchronous scheduling was "added later," while MRV2 takes it as a core design constraint—the goal is zero synchronization between CPU and GPU.

The figure below shows the standard asynchronous scheduling timeline, where the CPU prepares for step N+1 while the GPU executes step N:

Asynchronous Scheduling Timeline

The most direct benefit: asynchronous scheduling and speculative decoding can finally coexist cleanly. The figure below shows how MRV2 consumes rejection sampling results directly via GPU-side input preparation, eliminating all synchronization points:

MRV2 Speculative Decoding Async Optimization

3. Triton-Native Sampler

MRV2 rewrites the sampling logic:

Gumbel-Max sampling kernel, avoiding explicit softmax calculations.
More efficient top-k logprobs, finding top-k logits first before calculating logprobs.
More memory-efficient prompt logprobs, supporting chunked processing within a single prompt.
Better compatibility with speculative decoding.

4. Stronger Modularity

V1's gpu_model_runner.py had already swollen to 6,700 lines. MRV2 introduces the ModelState abstract interface:

class ModelState(ABC):
    def add_request(self, ...):
    def remove_request(self, ...):
    def get_mm_embeddings(self, ...):
    def prepare_inputs(self, ...):
    def prepare_attn(self, ...):
    def prepare_dummy_inputs(self, ...):
    ...

This separates model-specific logic (multimodal embeddings, extra inputs, attention metadata) from the general execution path. The largest file is now controlled to be under 1,300 lines.

This is extremely important for developers of different model series like DeepSeek, Qwen, and Kimi—you only need to care about your model's ModelState, without reading thousands of lines of irrelevant code.

Performance Testing

Running the small model Qwen3-0.6B on GB200 (intentionally using a small model to amplify the impact of CPU overhead), throughput jumped directly from 16K to 25K:

MRV2 Throughput Increased by 56.2%

In speculative decoding scenarios: 4-card GB200 + GLM-4.7-FP8 + MTP=1, TPOT reduced by 6.3%:

MRV2 TPOT Comparison

The improvement comes from the zero-synchronization design—CPU-GPU synchronization points are completely eliminated after enabling speculative decoding.

Try It Now

export VLLM_USE_V2_MODEL_RUNNER=1
# Then use vLLM normally, no code changes needed

However, note that MRV2 is currently experimental. In v0.18.0, several features are not yet supported: linear attention models (Qwen3.5, Nemotron 3 Super), speculative decoding methods other than Eagle/Eagle3/MTP, LoRA, etc.

Lao Zhang Says: MRV2 is a "bone-breaking" refactoring, but the direction is absolutely correct. Moving input preparation to the GPU, achieving zero-sync asynchronous scheduling, and introducing ModelState decoupling—these improvements are not just "icing on the cake" but lay the foundation for future complex scenarios involving heterogeneous models + speculative decoding + multimodal coexistence. The 56% throughput increase is just the beginning; as more features migrate to MRV2, the benefits will continue to be released.

Summary: vLLM March 2026 Panorama

Update	Release Date	One-Sentence Summary
Semantic Router v0.2 Athena	March 10	Evolved from a router to a system brain for multi-model orchestration
Nemotron 3 Super	March 11	120B total params/12B active, an MoE model tailor-made for multi-agents
P-EAGLE	March 13	Handles all draft tokens in one forward pass; speculative decoding no longer has sequence bottlenecks
Model Runner V2	March 24	Complete overhaul of vLLM core engine: GPU-native + zero-sync + strong modularity

Looking at these four releases together, vLLM's strategic intent is very clear:

Underlying Layer—MRV2 rebuilds the engine foundation, preparing for more complex inference demands.
Acceleration—P-EAGLE breaks through the ceiling again in the key optimization direction of speculative decoding.
Models—Nemotron 3 Super fills the ecological niche for efficient MoE models.
Upper Layer—Semantic Router Athena begins handling multi-model orchestration and agent scheduling.

From "inference engine" to "inference platform," vLLM is completing a leap from a tool to an ecosystem.

Relevant Links:

Semantic Router v0.2 Athena: https://vllm.ai/blog/v0.2-vllm-sr-athena-release
Nemotron 3 Super: https://vllm.ai/blog/nemotron-3-super
P-EAGLE: https://vllm.ai/blog/p-eagle
Model Runner V2: https://vllm.ai/blog/mrv2
vLLM Official Site: https://vllm.ai
Semantic Router GitHub: https://github.com/vllm-project/semantic-router

#vLLM #LargeModelInference #SpeculativeDecoding #Nemotron #SemanticRouter

Production is not easy; if you find this article useful, could you please follow? Give me a triple hit: Like, Share, and View. If you can add a star 🌟, thank you for reading my article, see you in the next one!