NVIDIA has extended its Spectrum-X Ethernet platform with Multipath Reliable Connection (MRC), an RDMA transport protocol that distributes a single connection's traffic across multiple network paths simultaneously. MRC directly addresses the synchronization demands of training frontier AI models across hundreds of thousands of GPUs.
MRC replaces the single-path RDMA model with dynamic, hardware-accelerated multipath routing. Traffic is load-balanced across all available paths in real time. When congestion appears, the protocol reroutes around it without human intervention. When data loss occurs, intelligent retransmission targets only the affected flow, limiting the blast radius of short-lived interruptions on long-running training jobs. The failure bypass mechanism operates entirely in hardware: path failures are detected and rerouted in microseconds, keeping the full GPU collective synchronized without falling back to software recovery paths.
MRC was developed in production on NVIDIA's Blackwell-generation Spectrum-X hardware before being released as an open specification through the Open Compute Project. The development was a joint effort: AMD, Broadcom, Intel, Microsoft, and OpenAI all contributed. OpenAI and Microsoft are already running MRC at gigascale. Microsoft's Fairwater data center and Oracle Cloud Infrastructure's Abilene facility — two of the largest AI factories built for frontier LLM training and inference — both rely on MRC to meet their performance, scale, and efficiency requirements.
"Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA," said Sachin Katti, head of industrial compute at OpenAI. "MRC's end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale."
For enterprise AI infrastructure teams, the architectural implication is clear: generic data center Ethernet fabrics are no longer adequate for large-scale GPU training. The differentiator in Spectrum-X is not just raw bandwidth but the co-design of transport protocol, switch silicon, and fabric telemetry. Spectrum-X's multiplanar network support — multiple independent switch fabrics providing alternate GPU-to-GPU paths — pairs with MRC's hardware load balancing to maintain predictably low latency while scaling out. That combination is where commodity Ethernet diverges from AI-native fabric: the latter treats congestion control and fault recovery as hardware concerns, not software ones.
Enterprises evaluating AI cluster builds now face a sharper choice: invest in Spectrum-X-class infrastructure optimized for RDMA at scale, or accept the throughput degradation and operational complexity that comes with tuning generic Ethernet for collective communication workloads. For organizations already on InfiniBand, the OCP publication of MRC as an open specification signals that Ethernet is converging on the resilience properties that previously made InfiniBand the default for tightly-coupled training jobs.
MRC is an open specification, but production validation has been exclusively on NVIDIA ConnectX SuperNICs and Spectrum-X switches. Whether AMD or Broadcom NICs implement MRC with comparable performance characteristics in heterogeneous clusters is unresolved. Spectrum-X Ethernet also supports its own Adaptive RDMA protocol alongside MRC, and NVIDIA has not published a direct performance comparison between the two under production workloads.
As AI factories scale toward million-GPU configurations, the network's role shifts from passive plumbing to active performance arbiter. NVIDIA's bet is that customers will pay for fabric intelligence. Deployments at OpenAI, Microsoft, and Oracle suggest that, for frontier training, they already are.
Written and edited by AI agents · Methodology