NVIDIA Just Disrupted Itself: Autonomous Agents Evolve for 7 Days, Rendering Operator Engineers and GPU Experts Obsolete

Image

Machine Heart Editorial Team

This is arguably the most explosive article to break news today.

It has already sparked an uproar in numerous WeChat groups dedicated to operator development.

"This may be the first true manifestation of superhuman intelligence in the software domain," Bing Xu from NVIDIA just declared on X. He was commenting on a new NVIDIA research project called AVO, co-authored by himself, Terry Chen, and Zhifan Ye.

Image

In this study, submitted to arXiv just this Thursday, NVIDIA constructed the Agentic Variation Operator (AVO). This is a novel class of evolutionary mutation operators that replaces the fixed mutation, crossover, and manually designed heuristics of classical evolutionary search with autonomous coding agents, achieving truly staggering practical performance.

Xu Bing stated, "In some highly optimized attention mechanism workloads, the agent can continuously search within the optimization loop for 7 days without any human intervention, thereby surpassing almost all human GPU experts." Such performance from AVO is likely to make many kernel and DSL developers tremble.

Image

Huang Zhipeng's post on X

Interestingly, in his X post, Xu Bing shared that a year and a half ago, when he and Terry Chen first began researching agent programming at NVIDIA, they did not understand GPU programming themselves. "Therefore, from the very beginning, we were committed to developing a fully automated system requiring no human intervention." They call this approach "blind coding."

"Over the past year and a half, the two of us have developed four generations of agents across two agent systems. Starting from the second generation, these agent stacks began to self-evolve. Now, each agent's codebase consists of approximately 100,000 lines of code (excluding empty lines)."

He also emphasized the profound significance behind AVO: "I would bet that blind coding is the future of software engineering. Human cognitive capacity is the bottleneck."

Let's take a detailed look at what contributions this paper, which may usher in the era of "blind coding," has actually made.

Image

Large Language Models (LLMs) have become powerful components in Evolutionary Search, replacing hand-crafted mutation operators with learned code generation. In these systems, the LLM generates candidate solutions based on selected parents, while a typically heuristic-based framework handles parent sampling, evaluation, and population management. This combination has achieved remarkable results in mathematical optimization and algorithm discovery, including flagship systems like FunSearch and AlphaEvolve.

However, restricting LLMs to the role of candidate solution generation within a preset pipeline fundamentally limits their discovery capabilities: each call produces only a single output, unable to proactively consult reference materials, test their changes, interpret feedback, or correct schemes before submitting candidates. This limitation is particularly pronounced for implementations that have been极致ly tuned by humans and require deep iterative engineering to improve further.

Researchers investigated this issue within the context of attention mechanisms. The attention mechanism is the core operator of the Transformer architecture and one of the most densely optimized GPU operators. The FlashAttention series and NVIDIA's cuDNN library have pushed the attention throughput of successive GPU generations to hardware limits; on the latest Blackwell architecture, both FlashAttention-4 (FA4) and cuDNN require months of manual optimization. To surpass these implementations requires continuous, iterative interaction with the development environment: researching hardware documentation, analyzing Profiler outputs to identify bottlenecks, implementing and testing candidate optimizations, diagnosing correctness failures, and refining strategies based on accumulated experience.

Recent advances in Deep Agents suggest that LLMs equipped with planning, persistent memory, and tool-use capabilities can autonomously handle such multi-step engineering workflows, ranging from solving complex GitHub issues to generating critical deep learning software. This prompts a radically different role for LLMs in evolutionary search: rather than confining them within a fixed pipeline, elevate the Deep Agent to become the mutation operator itself.

To this end, NVIDIA proposed the Agentic Variation Operator (AVO). In this paradigm, a self-directed code agent replaces the mutation and crossover processes of previous single-turn LLM or fixed workflow systems. The AVO agent has access to all previous solutions, domain-specific knowledge bases, and evaluation tools. It can autonomously decide what to consult, what to modify, and when to evaluate, thereby enabling continuous improvement over long cycles.

Image

To validate its effectiveness, NVIDIA applied AVO to Multi-Head Attention (MHA) kernels on the NVIDIA Blackwell B200 GPU, comparing them directly against expert-optimized cuDNN and FlashAttention-4 kernels. In a 7-day continuous autonomous evolution without human intervention, the agent explored over 500 optimization directions and evolved 40 kernel versions. The final generated MHA kernel achieved a throughput of up to 1,668 TFLOPS in BF16 precision, surpassing cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% in the test configurations.

Analysis of the optimization schemes discovered by the agent revealed that these optimizations covered multiple layers of kernel design, including register allocation, instruction pipeline scheduling, and load distribution, reflecting true hardware-level reasoning. Experiments showed that optimization techniques discovered on MHA could be effectively transferred to Grouped-Query Attention (GQA): the agent required only 30 minutes of additional autonomous adaptation to support the evolved MHA kernel on GQA, achieving performance improvements of up to 7.0% over cuDNN and 9.3% over FlashAttention-4.

The main contributions of this study are as follows:

  • Proposal of Agentic Variation Operators (AVO): A novel class of evolutionary mutation operators that elevates agents from mere candidate generators to mutation operators. Agents autonomously explore domain knowledge, implement modifications, and verify results through iterative interaction with the environment.

  • Achievement of SOTA Performance: On the NVIDIA B200 GPU, researchers achieved state-of-the-art MHA throughput in benchmark configurations, reaching 1,668 TFLOPS. This performance surpasses cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5%. Furthermore, they demonstrated that these optimizations can be easily migrated to GQA, yielding significant performance gains with only 30 minutes of autonomous evolution.

  • Micro-architecture Optimization Analysis: Researchers conducted a detailed analysis of the micro-architectural optimizations discovered by the agent in benchmark settings, indicating that the agent performs genuine hardware-level reasoning rather than superficial code transformations.

Saying Goodbye to Pipelines:
AI Agents Become the True "Evolutionary Helmsmen"

In traditional LLM-based evolutionary search frameworks, models are often trapped in fixed pipelines, serving merely as generators of candidate code. They can only output one result per call, unable to proactively consult references, test code, understand feedback, or correct strategies before final submission. For top-tier hardware optimization tasks requiring deep, repeated iterations, this limitation is fatal.

AVO breaks this limitation by instantiating the "mutation operator" as a self-driven agent loop. This AI agent can freely access previous code version histories, call upon domain-specific knowledge bases (such as CUDA programming guides and PTX architecture documents), and proactively propose, fix, critique, and validate code modifications based on execution feedback.

Image

In short, AVO elevates AI from a passive "code generator" to an all-encompassing "evolutionary helmsman."

7 Days of Autonomous Operation:
Defeating Top Benchmarks on the Blackwell Architecture

The research team deployed AVO on an extremely challenging task: optimizing the core code for Multi-head Attention (MHA) on the NVIDIA Blackwell (B200) GPU. The attention mechanism is the core of the current Transformer architecture and one of the most极致ly optimized computational targets on AI chips.

With absolutely no human intervention, the AVO agent ran continuously and autonomously for 7 days.

During these 7 days, the agent explored over 500 optimization directions in the background and ultimately submitted 40 valid iterative versions. Eventually, the MHA core it generated achieved a throughput of up to 1,668 TFLOPS in BF16 precision.

Image

In benchmark tests, the results submitted by AVO were astonishing:

  • Compared to NVIDIA's official closed-source cuDNN library customized for Blackwell, throughput increased by up to 3.5%.

  • Compared to the current state-of-the-art open-source benchmark FlashAttention-4, throughput increased by up to 10.5%.

Powerful Generalization Capability:
30 Minutes to Migrate to Grouped-Query Attention

Even more impressive is that these low-level micro-architectural optimizations discovered by the agent are not overfitted to specific scenarios. When researchers asked AVO to adapt the optimized MHA core to Grouped-query Attention (GQA), which is commonly used in today's large models, the agent completed the task with only about 30 minutes of autonomous adjustment.

Image

In GQA tests, AVO maintained an absolute leading advantage, with performance up to 7.0% higher than cuDNN and 9.3% higher than FlashAttention-4. This indicates that the computation and memory access optimization patterns discovered by the agent during MHA evolution can effectively generalize to GQA tasks with different computational characteristics.

Deep Micro-architectural Reasoning

Analyzing the code changes submitted by AVO reveals that the AI agent was not just doing surface-level work but engaged in true, deep logical reasoning at the hardware level:

  • Branch-free Accumulator Rescaling: By eliminating conditional branches, the agent removed the overhead of warp synchronization and replaced it with lighter-weight memory barriers, boosting non-causal attention throughput by 8.1% in one go.

  • Overlapping Error Correction and Tensor Core (MMA) Pipelines: The agent reorganized the execution pipeline, transforming sequentially executed dependencies into overlapping pipeline execution, significantly reducing hardware idle wait times.

  • Cross-Warp Group Register Rebalancing: By analyzing profiler data, the agent discovered that certain operation groups suffered from register shortages, causing data spillover into slower local memory. It decisively reallocated Blackwell's budget of 2,048 registers, squeezing out a further 2.1% performance gain.

NVIDIA's study proves that AI agents now possess the capability to perform joint reasoning across multiple hardware subsystems, such as synchronization, memory ordering, pipeline scheduling, and register allocation. As an evolutionary mutation operator not limited to specific domains, AVO points the way for future automated software system optimization. It can not only be used for AI chip and deep learning underlying ecosystem development but is also expected to shine in all scientific and engineering fields with extreme demands on computing power in the future.

With AI agents' self-evolution reaching this level, are you scared?

Reference Links

https://x.com/bingxu_/status/2036983004200149460?s=46

https://x.com/nopainkiller/status/2036986666410532972

Image

© THE END

Please contact our official account for reprint authorization.

Submissions or media inquiries: liyazhou@jiqizhixin.com


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.