TL;DR
It's GTC season again, coinciding with the 20th anniversary of CUDA this year. Jensen's entire Keynote reviewed the CUDA ecosystem, then discussed the arrival of the inference era and predicted sustained growth in the inference market. This was followed by the most exciting hardware release segment, featuring the Groq 3 LPU and the entire Rubin series family, with some new changes. Next came the release of OpenClaw and Nvidia's own NemoClaw. Finally, there was a discussion on Physical AI and Robotics. We will now review the entire Keynote in detailed sections. The video replay can be accessed at GTC 2026 Keynote.
1. 20 Years of CUDA
This year marks the 20th anniversary of CUDA's release.
Jensen Huang started by recalling the journey from programmable Pixel Shaders in 2001. Readers unfamiliar with this history can refer to my previously compiled special topic "History of GPU Architecture Evolution".
He then introduced some CUDA-related ecosystems. The first example was a demo of RTX-related DLSS 5. DLSS 5 introduces real-time neural rendering models, capable of injecting realistic lighting and material effects into pixels. DLSS 5 bridges the gap between rendering and reality, enabling game developers to create unprecedentedly realistic computer graphics, achieving Hollywood-level visual effects.
He continued by introducing CuDF for processing structured data.
And cuVS (Vector Search) for processing unstructured data.
He demonstrated the entire software stack through examples from GCP, AWS, Azure, Oracle, CoreWeave, and their customers, ensuring all good partners were covered. Below are each Cloud provider and their typical customers.
He then introduced some on-prem deployments in collaboration with Dell:
Next, Jensen introduced applications across multiple industries. Interestingly, he provided a detailed explanation of the transition in Quant from traditional feature engineering to AI models automatically discovering feature factors. However, he seemed to stumble slightly when introducing Telco; could this be related to some difficulties with AI RAN itself?
Jensen then continued to introduce his friends, a series of AI-native companies. Interestingly, three domestic model companies were featured: Deepseek, Kimi, and Qwen. Why weren't Zhipu and Minimax, both already listed companies, included?
2. The Era of Inference
Jensen recalled several representative moments from the past two to three years: the LLM era brought by ChatGPT, the LRM era brought by o1, and the Agentic era brought by Claude Code. The concept of "Inference Inflection" is also quite interesting. Could there be another 100x growth in the future?
This essentially declares that the entire era of inference has fully arrived. Jensen then discussed the order situation for 2026, causing the stock price to surge 直线 immediately. However, it fell back quickly, indicating that the impact of high-frequency programmatic trading is still very intense.
He also predicted sustained growth for the entire market.
He then began to emphasize the changes brought by Blackwell, such as NVL72 and nvfp4, and their optimization for inference. This includes reduced power consumption, increased performance, and a rapid decline in inference costs.
Regarding inference speed, isn't the chart below a bit odd? The model used is Kimi K2.5, yet it ranks among the last... What does this imply? Does it prove that advanced cards are truly not exported to China?
He once again emphasized the concept of the "AI Factory".
3. Hardware
Jensen used a video to introduce the evolution starting from the earliest DGX, reviewing the past decade of development with Volta, Ampere, Hopper, and Blackwell as the opening for this section. Then, the full picture of the Rubin generation was revealed, with the Groq 3 LPU becoming the highlight of this release.
Jensen then displayed the Groq 3 LPU Compute Tray, NVL6 Switch Tray, and Rubin Compute Tray.
The Groq 3 LPU will be discussed in detail later. Next are the BF4 storage server composed of CX9+Vera, the Vera CPU Tray, and the CPO switch.
Let's talk about the server first. Originally, it was advertised that CX9 and Grace were packaged together to form a single DPU, but the actual product displayed showed CX9 and independent Grace chips. Recently, due to demands for storage density and insufficient Grace PCIe lanes, the design switched to using CX9 paired with the Vera CPU. Actually, what is the difference between this scenario and using a standard x86 CPU + CX9? Furthermore, Nvidia's accumulation in storage is still too limited; there is much they haven't figured out regarding how DPU can 支撑 (support) storage applications well.
Then there is the Vera Compute Tray for Agentic workloads. A single Vera Compute Tray integrates 8 Vera processors, each with 88 cores, while supporting 8-channel LPDDR5x memory. A single socket supports 1.2TB/s of memory bandwidth. The Compute Tray integrates 2 BF4-DPUs.
We will analyze the topic of CPO Switches later.
Another interesting topic is that Jensen believes the Orben rack structure, based on CableTray and liquid cooling, is very helpful for rapid deployment. Therefore, an Ethernet 256 version was also provided. It is 猜测 (guessed) that the Switch Tray is replaced with an Ethernet switch. The entire rack supports 32 Vera ComputeTrays, connecting 256 CPUs to the whole Rack. Technically, it is estimated to still use the BF4 DPU, connected via CableTray. It likely utilizes the multi-plane technology introduced by CX8/CX9, where one CX9 800Gbps port is split into eight 112G ports, connecting to 8 Switch Trays. By using the relatively mature Orben rack structure (the diagrams before and after are inconsistent here; at this point Jensen showed a rack with only 2 Switch Trays, but the Roadmap page later shows 8), complex fiber cabling is avoided. It can also reduce the power consumption brought by optical modules.
Note, this ETH256 is just a standard Ethernet FrontEnd network connected via CableTray.
The image below shows the structure of the Orben ETH256 rack on the left, supporting Vera CPUs with 16 Vera ComputeTrays on the top and bottom each. On the right is the Orben rack for BF4 storage; it seems there is no backplane CableTray, at least no connectors were seen on the backplane for storage servers, only power interfaces.
Next, the Rubin Ultra and the Kyber Rack Midplane were released.
However, the DieSize of Rubin Ultra seems inconsistent with previous promotions. Both boards belong to a display version. It can be seen that the ComputeTray has 4 Rubin Ultras and 2 Vera CPUs, as well as 4 CX9s and one BF4 DPU. It is also configured with 4 NVMe drive slots.
The ComputeTray is placed vertically and connected to the Kyber mid-backplane. Looking closely, it can hold 18 ComputeTrays.
Finally, there is the switch backplane, which is also placed vertically. The orthogonal architecture without a mid-backplane did not appear.
The main reason is that in this generation, considering the original CableTray wiring distance was too long, a mid-backplane structure was adopted to build Shuffle lines, grouping the Serdes of the 18 front-panel ComputeTrays to connect to different slots on the rear panel.
Then came the performance growth brought by Rubin NVL72 and the revenue growth driven by power constraints, continuing to promote Rubin.
Driven by the need for reduced power consumption and inference speed, the acquisition of Groq was derived. Based on the revenue growth from Groq 3 LPX, it can be seen that under the same power consumption, Vera-Rubin + Groq 3 LPX can improve energy efficiency by 1x relative to Rubin alone.
Then there is the solution of Rubin for Prefill and Groq 3 for Decode. The SRAM of a single Groq 3 LPU has increased to 500MB, and the bandwidth has increased to 150TB/s.
For a detailed architectural analysis of Groq, you can refer to "Discussing Groq, Valued at 20B by NV". A detailed analysis of Prefill and Decode workloads shows that Prefill is Compute Bound and Memory Capacity Bound, while Decode is Memory Bandwidth Bound. Therefore, to some extent, the memory bandwidth for Decode needs to be further increased. The single Groq 3 LPU SRAM capacity has increased to 500MB (from 220MB in the first generation), and bandwidth has increased from 80TB/s to 150TB/s. However, this generation only supports FP8. Next, NV will release a Groq L35 to support nvfp4.
Jensen then compared the Rubin GPU with a ComputeTray composed of 8 Groq 3 LPUs, and this trend is clearly visible. The Prefill node supports higher computing power and larger memory capacity, while the Decode node focuses more on improving memory bandwidth.
Additionally, we noticed that the original Rubin CPX solution seems to have been cancelled. It is roughly guessed that the price of DDR has risen terribly, and the 1:1 ratio of the Rubin CPX solution already had many issues. For detailed analysis, refer to "Detailed Analysis of Nvidia Rubin CPX". Actually, we can analyze the Agentic LLM workload in detail. Since Context usually exceeds 200K and will further reach 1M in the future, moving KVCache requires larger bandwidth. PCIe-based Rubin CPX might be somewhat inadequate.
Interestingly, regarding how to use Groq while separating PD (Prefill/Decode), Jensen drew AFD here. It continues to use Rubin for Attention, while Groq 3 LPU only handles FFN. Actually, there are several questions here that don't hold up to scrutiny. First, it transmits EP traffic across racks; what network is used? If it's ScaleOut, there is only one BF4 on the LPX ComputeTray. Another issue is how Groq's deterministic execution supports MoE? If the Rubin Attn node calculates the MoE Gate Index and writes the index into the packet to send only one copy to the entire LPX rack, then performs dispatch and combine within the rack, the cross-rack interconnect bandwidth becomes a bottleneck. Then, does it do some mask processing inside Groq to exclude experts from calculation? On the other hand, dispatching from Rubin and sending tokens one by one directly to the corresponding LPU for FFN, buffering some uncalculated tokens in the LPU's external I/O buffer, and then combining after calculation. This places even higher demands on the interconnect bandwidth between the two racks. It seems there is no interconnect ScaleUP, after all, the two use different protocols (LPU C2C and NVLink).
Another issue is that for models exceeding 1T parameters, the cumulative SRAM capacity of 256 LPUs in a single LPX rack is only 128GB, which seems insufficient to hold these expert parameters (currently Groq 3 only supports FP8). Therefore, the entire AFD solution actually doesn't hold up to scrutiny. It is unknown how NV plans to solve these problems.
The structure of the Groq 3 LPX ComputeTray is as follows. It can be seen that it still continues to use the original LPU C2C interface, not adopting NVLink, nor are there corresponding switch chips. There might be a transition process to NVLink in the future.
Finally, the full family portrait was displayed again, revealing that Rubin has been powered on and delivered to Microsoft for testing.
Jensen once again emphasized the importance of storage, transitioning from humans traditionally using CuDF/CuVS to AI using storage, including new KVCache requirements. AI's demand for processing speed for these tasks will be higher, thus the demand for storage will also be stronger.
Next, some Roadmaps were discussed.
In the Rubin generation, it will soon launch Groq 3.5 (LP35) supporting NVFP4 in conjunction with Rubin Ultra. The CX9 is still misleading; it is clearly an 800Gbps ASIC but written as 1.6Tbps. A relatively big change is that Jensen is still obsessed with his NVL576. On the Oberon chassis, 8 racks will be connected in parallel. However, this requires NVLink to support optical interconnects. How is the reliability issue resolved? How is the decrease in overall MTBF handled after the failure domain is enlarged? There are actually many engineering challenges. Then regarding the support for ETH256 interconnect, it is emphasized again that it is just using CableTray to connect Vera CPUs; at this point, it is a standard 800Gbps Ethernet, not the ETH-ScaleUP commonly seen in China.
Similarly, in this generation of Kyber ScaleUP, it will also support 8-rack parallel interconnects. It will be interesting to see how they solve the optical reliability issues. Could the pressure from Huawei UB's thousands of cards ScaleUP be transmitting to Jensen?
Then regarding the Feynman generation, it is clear that Feynman adopts 3D stacking, but not stacking Groq LPUs; it is more about stacking customized HBM. Then, this generation's LPU will switch from LPU C2C to NVLink. Worth mentioning is that this generation will fully support CPO optical interconnects for ScaleUP and ScaleOut. Also, CX10 and BF5 are scheduled for 2028.
The judgment on CPO is basically consistent with my previous detailed analysis; for specifics, refer to "Discussing Some Issues with Optical Interconnects".
Finally, addressing the issue of insufficient electricity on Earth, it was mentioned that they are researching radiation-hardened Vera Rubin architectures for space.
4. Agentic Computing
Next, Jensen started talking about lobsters, opening with lobster farming.
Then he began introducing Agentic Computing, stating it brings massive changes similar to Linux, HTTP, and HTML. This point was also mentioned by Xiantao (Head of JVS Claw, President of Alibaba Cloud Terminal Intelligent Computing Division). As a technical veteran with over twenty years of Linux kernel experience, his judgment on OpenClaw was very accurate. He has been focusing on the safe execution and native interactive experience of lobsters since the release of OpenClaw. Recently, JVS Claw was released: "Want to raise lobsters safely and simply? Choose JVS Claw". We noticed that Jensen's judgment and the entire train of thought behind NemoClaw are basically consistent with JVS Claw. Both emphasize safety and ease of deployment capabilities, building the entire ecosystem centered around Agents.
Jensen then declared the transformation of the entire enterprise IT from SaaS to Agent-as-a-Service. Is this a death sentence for SaaS?
He then introduced some of Nvidia's open-source models and corresponding partners.
5. Robotics & Physical AI
In terms of autonomous driving, manufacturers like BYD, Geely, Hyundai, and Nissan have joined RoboTaxi, cooperating with Uber. On the robotics side, there are vendors like KUKA, FANUC, ABB, and many other robot/drone platforms. Then there is the entire supporting software and hardware platform, including simulation/emulation, etc. It was once again emphasized that GB300 is used for training, RTX6000 for simulation, and Thor for terminal execution in the hardware stack.
The final Easter egg is the summary MV at the end. The song is written very well, and the lyrics are quite interesting. It is worth listening to.
References
[1] GTC 2026 Keynote: https://www.youtube.com/watch?v=jw_o0xr8MWU&t=4438s