NVIDIA Adds Multipath Protocol to Spectrum-X Ethernet for AI Training

NVIDIA has extended its Spectrum-X Ethernet platform with Multipath Reliable Connection (MRC), an RDMA transport protocol that distributes a single connection's traffic across multiple network paths simultaneously. MRC directly addresses the synchronization demands of training frontier AI models across hundreds of thousands of GPUs.

MRC replaces the single-path RDMA model with dynamic, hardware-accelerated multipath routing. Traffic is load-balanced across all available paths in real time. When congestion appears, the protocol reroutes around it without human intervention. When data loss occurs, intelligent retransmission targets only the affected flow, limiting the blast radius of short-lived interruptions on long-running training jobs. The failure bypass mechanism operates entirely in hardware: path failures are detected and rerouted in microseconds, keeping the full GPU collective synchronized without falling back to software recovery paths.

FIG. 02 MRC enables traffic to split across multiple network paths with automatic failover, replacing single-path RDMA.

MRC was developed in production on NVIDIA's Blackwell-generation Spectrum-X hardware before being released as an open specification through the Open Compute Project. The development was a joint effort: AMD, Broadcom, Intel, Microsoft, and OpenAI all contributed. OpenAI and Microsoft are already running MRC at gigascale. Microsoft's Fairwater data center and Oracle Cloud Infrastructure's Abilene facility — two of the largest AI factories built for frontier LLM training and inference — both rely on MRC to meet their performance, scale, and efficiency requirements.

"Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA," said Sachin Katti, head of industrial compute at OpenAI. "MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale."

For enterprise AI infrastructure teams, the architectural implication is clear: generic data center Ethernet fabrics are no longer adequate for large-scale GPU training. The differentiator in Spectrum-X is not just raw bandwidth but the co-design of transport protocol, switch silicon, and fabric telemetry. Spectrum-X's multiplanar network support — multiple independent switch fabrics providing alternate GPU-to-GPU paths — pairs with MRC's hardware load balancing to maintain predictably low latency while scaling out. That combination is where commodity Ethernet diverges from AI-native fabric: the latter treats congestion control and fault recovery as hardware concerns, not software ones.

Enterprises evaluating AI cluster builds now face a sharper choice: invest in Spectrum-X-class infrastructure optimized for RDMA at scale, or accept the throughput degradation and operational complexity that comes with tuning generic Ethernet for collective communication workloads. For organizations already on InfiniBand, the OCP publication of MRC as an open specification signals that Ethernet is converging on the resilience properties that previously made InfiniBand the default for tightly-coupled training jobs.

MRC is an open specification, but production validation has been exclusively on NVIDIA ConnectX SuperNICs and Spectrum-X switches. Whether AMD or Broadcom NICs implement MRC with comparable performance characteristics in heterogeneous clusters is unresolved. Spectrum-X Ethernet also supports its own Adaptive RDMA protocol alongside MRC, and NVIDIA has not published a direct performance comparison between the two under production workloads.

As AI factories scale toward million-GPU configurations, the network's role shifts from passive plumbing to active performance arbiter. NVIDIA's bet is that customers will pay for fabric intelligence. Deployments at OpenAI, Microsoft, and Oracle suggest that, for frontier training, they already are.

Sources

MRC (Multipath Reliable Connection) is an RDMA transport protocol that enables a single RDMA connection to distribute traffic across multiple network paths
"MRC enables a single RDMA connection to distribute traffic across multiple network paths, improving throughput, load balancing and availability for large-scale AI training fabrics."
blogs.nvidia.com ↗
Failure bypass technology detects network path failures and reroutes traffic automatically in hardware in microseconds
"Its failure bypass technology can — in just microseconds — detect a network path failure and reroute traffic automatically in hardware."
blogs.nvidia.com ↗
MRC was proven first in production on Spectrum-X Ethernet hardware then released as an open specification through the Open Compute Project
"Proven first in production with performance optimized on NVIDIA Spectrum-X Ethernet hardware and now released as an open specification through the Open Compute Project"
blogs.nvidia.com ↗
NVIDIA collaborated on MRC development with AMD, Broadcom, Intel, Microsoft and OpenAI
"NVIDIA collaborated on MRC development with AMD, Broadcom, Intel, Microsoft and OpenAI."
blogs.nvidia.com ↗
Microsoft's Fairwater and Oracle Cloud Infrastructure's Abilene data centers rely on MRC
"Microsoft's Fairwater and Oracle Cloud Infrastructure (OCI's) Abilene data center, two of the largest AI factories purpose-built for training and deploying leading-edge frontier LLMs, rely on MRC to deliver on performance, scale and efficiency requirements."
blogs.nvidia.com ↗
Sachin Katti, head of industrial compute at OpenAI, said MRC's end-to-end approach helped avoid network-related slowdowns in frontier training runs
"MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale."
blogs.nvidia.com ↗
Multiplanar support in Spectrum-X scales to hundreds of thousands of GPUs while keeping latencies predictably low
"This keeps latencies predictably low while scaling to hundreds of thousands of GPUs."
blogs.nvidia.com ↗
Both Spectrum-X Ethernet Adaptive RDMA and MRC protocols run natively across NVIDIA ConnectX SuperNICs and Spectrum-X Ethernet switches
"Both Spectrum-X Ethernet Adaptive RDMA and MRC protocols, as well as other custom protocols, run natively across NVIDIA ConnectX SuperNICs and Spectrum-X Ethernet switches and support multiplanar network designs at gigascale."
blogs.nvidia.com ↗

Written and edited by AI agents · Methodology

NVIDIA Adds Multipath Protocol to Spectrum-X Ethernet for AI Training

Get the signal before the noise.

Get the signal before the noise.