Nvidia's Open-Source Masterpiece: An 8B Small Model Beats GPT-5, Costs Only 30%, and Is 2.5x Faster! Nvidia's Research Director: Optimizing a Single LLM for Agents is Plain Wrong! Letting Small Models Manage Large Models is More Effective

Editor | Yun Zhao

Yesterday, Nvidia's founder, Jensen Huang, unveiled an astonishing framework for building Agentic systems called "BluePrint" at CES.

The very next day, a team from Nvidia publicly shared their own orchestration framework for Tool Use!

And this framework directly surpassed GPT-5 on the GAIA leaderboard!

Just moments ago, Nvidia's Research Director, Pavlo Molchanov, posted an announcement stating that Nemotron-ToolOrchestra ranked first in the GAIA Agent benchmark, with an average score of 90.37%, surpassing competitors using tools like GPT-5 and Claude Opus, highlighting the potential of coordinated architectures in the AI agent field.

ps: GAIA is a benchmark specifically designed to evaluate the real agent reasoning capabilities of AI assistants.

Moreover, ToolOrchestra was actually released as early as last November. It achieved quite impressive results back then.

The most intuitive aspect is that it used a small 8B model to defeat GPT-5 on the highly difficult "Humanity's Last Exam" benchmark with an accuracy of 37.1%, while the overall cost was less than 30% of the latter, and the speed was 2.5 times faster.

Pavlo expressed great excitement, stating that the facts prove that by enhancing the coordination capabilities of small models, rather than relying on the inference capabilities of more powerful super-large models, it is possible to achieve even better logic models that surpass giant intelligent systems, while simultaneously ensuring efficiency and cost-effectiveness.

An orchestration framework that manages a series of models and tools

Pavlo then revealed the detailed research behind this framework: ToolOrchestra.

In fact, this framework was released at the end of November last year. Looking at it, it can be seen that more than 80% of the authors are of Chinese descent.

ToolOrchestra is a framework and model for training specialized orchestration LLMs, capable of efficiently coordinating tools and other models.

The core innovation of the framework is the training of a small 8B parameter coordinator, which decomposes complex tasks into subtasks, selects appropriate tools or models, and executes them efficiently in sequence, avoiding reliance on a single large model.

To put it simply, ToolOrchestra is a method for training small-scale orchestration models used to uniformly schedule various tools and specialized models.

In terms of the specific method, the Nvidia team used an end-to-end reinforcement learning approach to train the Orchestrator. The final experiments proved that this method enables an 8B model to learn adaptive tool-use strategies under the joint guidance of result quality, efficiency, and human preference rewards.

In short, by using reinforcement learning to train the "orchestrator," the model acquires an adaptive Tool Use strategy.

Why can it beat GPT-5 on HLE?

How can an 8B small model, no matter how much it is trained, surpass the most powerful GPT-5 on an extremely complex and difficult benchmark?

But when you look closely at the results, it seems counter-intuitive.

Accuracy — Orchestrator-8B: 37.1%, GPT-5: 35.1%
Cost — Orchestrator is only 1/3 of GPT-5

Therefore, GPT-5's problem is not that it is "not strong enough," but that it is "too eager to do everything itself, or to have its brother models do it."

Many sub-problems could be solved more stably and cheaply using mathematical models, search, or code execution, but GPT-5 often:

"I think I can do it, let me think again."

While the Orchestrator focuses on being a good "dispatcher":

"This question shouldn't be for me to solve, I'll pass it to the more suitable one."

Intelligence is not about thinking the most, but about judging the most accurately.

Core Idea: Agent workloads should be layered,

Small models manage, large models work

The idea behind Nvidia's ToolOrchestra research is unique: it makes small models, instead of doing the hard work themselves, act as "commanders" for a collection of large models, small models, and external tools.

Pavlo stated that the core idea of its framework is the "layered thinking of Agent workloads":

One, intelligence ≠ one model can do everything;
Two, intelligence = tool coordination + specialized models;
Three, for difficult sub-tasks, use large models; for all other tasks, use small models.
Four, a small commander decides what to call, when to call, and why to call it.

In plain language:

Use a small model specifically responsible for judgment and dispatching; the actual work is done by a group of models and tools called on demand by it.

There are three key roles in the entire system:

Orchestrator (8B model): Not responsible for solving problems, only for judgment, dispatching, and decision-making: who should be used next?
Tool pool, including: multiple models and external tools. Mainly: powerful but expensive large models, cheap but fast small models, search, functions, external tools, etc.
Reward system. The goal is not just to reward "getting it right," but also to reward: being economical, being reasonable, and being human-like. That is, being smart is not enough, you also need to know when to let whom do the work.

Design philosophy: Orchestration first, not relying on manually written rules

Pavlo explained in the post,

the design philosophy of the Orchestrator-8B's reward system is different from previous Agent design methods,

emphasizing an orchestration-first concept. (Previously, more popular methods were prompt heuristic rules, hand-written strategies, etc.)

Its only task is to make decisions:

• Select tools and models

• Order multi-step workflows

• Weigh the trade-offs between accuracy, cost, and latency. Execution is fully delegated.

No prompt heuristic methods. No hand-written policies. Just a model trained for orchestration.

A design point easily overlooked: Using RL, not Prompt

It is worth noting here that it uses "RL to train orchestration," not prompt (heuristic rules or hand-written strategies)

This implies a clear signal from the team:

"Teaching a model to be a commander simply with prompts" does not work.

Reasons include:

Self-enhancement bias (preferring to use itself or its brother models)
Defaulting to the strongest model
Not sensitive to costs and preferences

This actually provides a great idea for the entire agent community:

To achieve truly controllable, reproducible, and cost-controllable agent behavior, RL + explicit reward structure is a viable path.

The design of the reward system is also worth studying

In addition, the most critical and core design in this paper is none other than the design of the reward system.

In previous Agent systems, the core problem was usually:

Can it use tools?

While ToolOrchestra solves a problem at another level:

Is it worth using GPT-5 for this step? Would using another model or tool be more suitable?

For this reason, the paper introduces three types of reward signals during training:

Outcome reward: Is the answer correct?
Cost reward: Is calling the strong model a "necessary expense"?
Preference reward: Does the scheduling method conform to human intuition for "reasonable decisions"?

This is where it differs from previous approaches. In fact, the industry has always had some misconceptions, often defaulting to: the smarter the model, the stronger its Tool Use capability.

If the quality of the Agent's output is not high, just switch to a more powerful model. A stronger model means higher quality results.

But in reality, Nvidia's research shows:

Simply switching to a stronger large model may not be as reliable as imagined, and it is also more wasteful of money.

过去的 agent 只在乎“答对”，而 ToolOrchestra 这篇论文把 Agent 的目标拆成了三件事，并且 同时优化：

Dimension	Past	ToolOrchestra
Correctness	✔	✔
Cost	❌	✔
User Preference	❌	✔

Notice this detail: as long as the final answer is wrong, cost and preference are both invalidated.

Note, this is a very engineering-oriented, very realistic trade-off logic: it is not "saving money for the sake of saving money," but "saving as much as possible on the premise of being correct."

The paper has an implicit stance here:

Tool scheduling is a strategy optimization problem, not an instruction execution problem.

Experimental results show: only 40% of steps called GPT-5, yet the effect was better.

An experiment on a benchmark task has interesting results:

In the complete task flow
Only about 40% of the steps called GPT-5
The remaining steps used cheaper models or tools

We tested on the τ²-Bench function calling benchmark, in which the Orchestrator demonstrated the ability to efficiently schedule multiple tools: it only called large models (GPT-5) for about 40% of the steps in the entire process, while the remaining steps used cheaper models or tools, but the overall performance still exceeded that of intelligent agents that call large models at every step.

In multiple high-difficulty tasks, the 8B small model commander completely outperforms GPT-5

And possesses high-level general reasoning capabilities

What is even more valuable is that the team's experiments found that the Orchestrator trained through ToolOrchestra not only defeated GPT-5 on "HLE," but also achieved the best scores on multiple high-difficulty reasoning benchmarks, such as τ²-Bench, which is specifically used to test "function-calling type Agents," and FRAMES, which is used to test "factual reasoning."

Note: Orchestrator achieved this surpass using only a fraction of the computational resources and actual time of cutting-edge models, while maintaining robust generalization capabilities for unseen tasks and tools (which is very impressive).

The results presented in the paper are already striking enough:

On high-difficulty reasoning benchmarks, 8B Orchestrator > GPT-5
On multi-step tool calling, function execution, and other tasks, performance is stable, with strong generalization ability
When changing tasks or tool combinations, the strategy still holds

But what is truly important is not "winning once." Let's summarize:

1. On HLE, a benchmark covering multi-disciplinary and highly difficult problems, Orchestrator significantly outperformed previous methods at a much lower computational cost.

2. On the τ²-Bench function calling benchmark, Orchestrator demonstrated the ability to efficiently schedule multiple tools: it only called large models (GPT-5) for about 40% of the steps in the entire process, while the remaining steps used cheaper models or tools, but the overall performance still exceeded that of intelligent agents that called large models at every step.

3. In addition, the evaluation on the FRAMES factual reasoning benchmark task also provided additional evidence for the universality and robustness of the Orchestrator. The team observed that although there were significant differences in the nature of the training and testing tasks, the Orchestrator trained through reinforcement learning was still able to adaptively adjust its tool usage strategy to cope with new challenges, indicating that it possesses a higher level of general reasoning capability.

Nvidia Research Director's Sharp Commentary

Optimizing a single large model for Agents is a mistake

If we only treat such results as a "benchmark PK" narrative, it would be too superficial.

More notably, it is quietly changing the focus of the narrative.

Nvidia's Research Director personally pointed out the significance of this research result for Agent development.

Why is this important?
Agent workloads are inherently: • Multi-round • Multi-tool • Multi-model
(Therefore) optimizing a single, massive LLM for it is the wrong abstraction.
ToolOrchestra shows a different path:
• Small models • Modular systems • Controllable behavior • Better scaling through coordination rather than parameters.

New direction for Agent development: Intelligence comes from management.

Small models can manage large models.

That is, this research releases a signal:

For Agents, the upper limit of intelligence no longer depends solely on the scale of the model, but shifts to the decision-making structure.

In other words, the model is no longer the only core asset.

"How to use it, when to use it, and which model to choose" itself may become the battleground after 2026.

Like the wonderful experiment Nvidia released in this article:

A small 8B model, although not a general expert, can become the "superior dispatcher" of GPT-5. The managed result is not only more accurate in answering complex tasks, but also achieves a crushing advantage in speed and cost.

Intelligence, perhaps, is shifting from being "calculated" to being "managed."

Paper address:

https://arxiv.org/abs/2511.21689

https://research.nvidia.com/labs/lpr/ToolOrchestra/

Project open-source address:

https://github.com/NVlabs/ToolOrchestra

—— Recommended Reading ——

Heavy! Jensen Huang's First Lecture of the Year: The World is Experiencing Two Platform-level Revolutions! New Computing Architecture Rubin Unveiled! Open-sources First Autonomous Driving AI, Releases AI-native Application Framework! Leaks Grok5 Parameter Scale

Musk's Another Bold Statement: Future Money is Essentially Power, AGI will be Achieved in 2026! China Far Exceeds Other Countries in AI Computing, The Cheapest Way for AI Computing is to Go to Space, Grok5 to be Released in Q1