Training a frontier AI model in 2026 requires tens of thousands of GPUs working in tight synchronization for months. Yet the factor that most often limits training speed is not GPU compute but the network that connects those GPUs. When a single busy link or failed switch can halt an entire training job, leaving thousands of expensive accelerators idle, the network becomes the defining bottleneck of AI infrastructure.
OpenAI's response is MRC (Multipath Reliable Connection), a new transport protocol developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA, and released as an open specification through the Open Compute Project. MRC is already deployed in production at OpenAI and Microsoft data centers, where it has been used to train the latest frontier models. This article examines why traditional networking protocols fail at AI scale, how MRC's architecture addresses those failures, and what its open-source release signals about the future of AI infrastructure.
The Bottleneck Shift: Why Networks Break at AI Scale
The standard narrative around AI training bottlenecks focuses on GPU scarcity. But inside the data centers where frontier models are trained, the constraint has shifted. When clusters scale beyond 100,000 GPUs, the network fabric becomes the limiting factor on overall throughput.
How Traditional Protocols Fail
Most large-scale AI training today runs over RoCEv2 (RDMA over Converged Ethernet), which extends InfiniBand's reliable connection semantics to Ethernet. RoCEv2 works well for traditional data center workloads, but it has three fundamental limitations when applied to synchronous AI training at scale:
Single-path constraint. RoCEv2 establishes one path per connection. Even when the physical network offers multiple paths between two GPUs, RoCEv2 traffic follows a single route determined by ECMP hashing. This leaves available bandwidth unused and creates congestion hotspots when multiple flows collide on the same path.
Poor failure handling. When a link fails or a switch reboots, RoCEv2 connections must be re-established, a process that can take milliseconds to seconds. In synchronous training, where all GPUs must complete each iteration together, even brief disruptions force the entire job to pause. A single failed switch can cascade into a cluster-wide stall.
Inefficient recovery. RoCEv2 uses go-back-N retransmission: when one packet is lost, the sender retransmits that packet and all subsequent packets in the current window. This creates unnecessary traffic that amplifies congestion rather than relieving it.
These limitations compound as clusters grow. Dell'Oro Group notes that Ethernet has already overtaken InfiniBand as the dominant fabric for AI back-end networks, but standard Ethernet transports were not designed for the synchronization demands of trillion-parameter model training.
The Economics of Idle GPUs
The financial stakes make this a hardware optimization problem. A 100,000 GPU cluster represents billions of dollars in capital expenditure. When network failures or congestion cause those GPUs to sit idle, the cost is measured in millions of dollars per day of lost training progress. The bottleneck is not the GPUs themselves but the infrastructure connecting them, a pattern that repeats across systems where individual components outpace the networks that integrate them.
MRC Architecture: Three Design Principles
MRC addresses these limitations through three architectural innovations that work together: packet spraying across multiple paths, intelligent failure recovery, and congestion control designed for AI workloads.
Packet Spraying: Using All Available Paths
The core insight behind MRC is that a single RDMA connection should not be bound to a single network path. Instead, MRC distributes packets across all available paths simultaneously, adapting the allocation in real time based on path conditions.
This approach, called adaptive packet spraying, differs fundamentally from ECMP. Traditional ECMP assigns each flow to one path based on a hash of the flow identifier. If two large training flows hash to the same path, that path becomes congested while alternative paths remain underutilized. MRC avoids this by making path selection a per-packet decision, informed by continuous feedback about path health and load.
OpenAI reports that this load balancing is effective enough that they see essentially no congestion in the network core. By spreading traffic evenly, MRC prevents the hot spots that create tail latency in synchronous training.
Failure Recovery in Microseconds
MRC handles path failures without requiring connection teardown and re-establishment. The protocol detects failures within a few round-trip times (microseconds, not milliseconds) and reroutes traffic automatically. This happens at the NIC level, without involving the application or the network control plane.
The combination of fast detection and automatic rerouting means that many network failures that would previously have interrupted training now pass without impact. The training job continues while the network heals around the failure.
Selective Recovery and Congestion Control
MRC replaces go-back-N retransmission with selective acknowledgment (SACK) and negative acknowledgment (NACK). When a packet is lost, only that packet is retransmitted, not the entire window. This reduces recovery traffic and prevents the congestion amplification that plagues RoCEv2.
For congestion control, MRC implements Network-Signaled Congestion Control (NSCC), based on the Ultra Ethernet Consortium specification. NSCC operates at the path level rather than the network level, allowing senders to adjust rates based on per-path conditions rather than reacting to global congestion signals. It also incorporates RTT-aware window control that adapts sending rates based on measured round-trip times.
From Proprietary to Open: The OCP Release
MRC was developed collaboratively by OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA, and released as an open specification through the Open Compute Project in May 2026. This choice of release mechanism carries strategic significance.
Why Open Standards Matter at AI Scale
AI infrastructure has become too large and complex for closed, vertically integrated systems to scale efficiently. A 100,000 GPU cluster involves multiple vendors: GPUs from NVIDIA or AMD, NICs from various suppliers, switches from multiple manufacturers, and software stacks from cloud providers and AI labs. When the networking protocol is proprietary, each vendor must negotiate interoperability, creating friction that slows deployment and limits optimization.
By releasing MRC through OCP, OpenAI and its partners are establishing a common foundation that any vendor can implement. This aligns incentives: NIC manufacturers can optimize for MRC, switch vendors can verify compatibility, and AI labs can deploy multi-vendor clusters without protocol-level integration work.
Production Validation
MRC is not a research prototype. It has been deployed in production at OpenAI's largest training clusters and at Microsoft's Fairwater supercomputer. NVIDIA reports that MRC is also used at Oracle Cloud Infrastructure's Abilene data center. These deployments span multiple hardware generations, including NVIDIA's Blackwell platform, and have been used to train production frontier models.
This production history matters for adoption. Data center operators are conservative about new networking protocols because failures are expensive. MRC's deployment at scale provides the validation that reduces adoption risk for other organizations.
The Industry Coalition
The breadth of the MRC coalition is notable. It includes GPU vendors (NVIDIA, AMD), NIC vendors (Broadcom, Intel), cloud providers (Microsoft), and AI labs (OpenAI). This is not a single vendor pushing a proprietary protocol, but an industry-wide agreement on a common transport layer.
Dell'Oro Group's analysis frames this as a signal that networking is becoming as strategically important as compute in AI infrastructure. The shift from InfiniBand to Ethernet for AI back-end networks, already underway, is reinforced by a protocol that makes Ethernet viable for the largest training clusters.
Implications for AI Infrastructure
MRC's release has implications that extend beyond the technical details of packet transport.
Infrastructure Becomes the Differentiator
When GPU compute is available from multiple vendors and cloud providers, the efficiency of the infrastructure connecting those GPUs becomes a competitive advantage. A cluster with MRC can sustain higher GPU utilization, recover faster from failures, and scale to larger node counts than a cluster with traditional RoCEv2. This shifts the basis of competition from raw compute capacity to infrastructure efficiency.
The Verification Challenge
Open-sourcing a networking protocol creates a verification challenge. Unlike software, where correctness can be tested with unit tests, network protocols must be validated under real-world conditions: mixed traffic patterns, hardware failures, and varying load. MRC's production deployment at OpenAI and Microsoft provides one form of validation, but broader adoption will require additional testing across diverse hardware and topology configurations.
The OCP specification and reference implementations provide the foundation for this verification. As more vendors implement MRC, the protocol will be tested in configurations beyond those used by its original developers, creating the feedback loop that improves reliability.
Ethernet's Expanding Role
MRC reinforces a trend already visible in the market: Ethernet's growing role in AI back-end networks. InfiniBand, historically dominant in high-performance computing, requires specialized hardware and expertise. Ethernet offers a broader supplier base, lower costs, and operational familiarity. MRC addresses the performance and reliability gaps that previously made Ethernet less attractive for synchronous training.
According to Dell'Oro Group, Ethernet already accounted for the majority of AI back-end network revenue in 2025. MRC's release strengthens this trajectory by providing a standard protocol that makes Ethernet viable for the largest training clusters.
FAQ
What is MRC and why was it created?
MRC (Multipath Reliable Connection) is a transport protocol that extends RDMA semantics to distribute traffic across multiple network paths. It was created because traditional protocols like RoCEv2 cannot efficiently utilize available network capacity or recover quickly from failures in clusters with 100,000+ GPUs.
How does MRC differ from TCP or standard RDMA?
TCP operates over a single path and recovers from loss through retransmission, which is too slow for synchronous training. Standard RDMA (RoCEv2) also uses a single path per connection and go-back-N recovery. MRC sprays packets across multiple paths, recovers selectively, and handles failures at the NIC level in microseconds.
Is MRC specific to OpenAI's infrastructure?
No. MRC was released as an open specification through the Open Compute Project and has been implemented by multiple vendors including AMD, NVIDIA, and Broadcom. It is already deployed at Microsoft and Oracle Cloud Infrastructure in addition to OpenAI.
Does MRC replace InfiniBand?
MRC extends RoCEv2, which runs over Ethernet. It does not replace InfiniBand directly, but it makes Ethernet more competitive for AI training workloads. The market trend shows Ethernet gaining share in AI back-end networks, and MRC accelerates this by addressing Ethernet's historical limitations for synchronous training.
What hardware is required for MRC?
MRC requires NICs that support the protocol. AMD has implemented MRC on its Pensando Pollara 400 and Vulcano 800 AI NICs. NVIDIA supports MRC on its Spectrum-X Ethernet platform with ConnectX SuperNICs. Broadcom has also announced support on its Thor Ultra NIC.
References
- OpenAI. "Supercomputer networking to accelerate large scale AI training." May 2026. https://openai.com/index/mrc-supercomputer-networking/
- Araujo, J., et al. "Resilient AI Supercomputer Networking using MRC and SRv6." arXiv:2605.04333, May 2026. https://arxiv.org/abs/2605.04333
- Open Compute Project. "Multipath Reliable Connection (MRC) Specification." https://www.opencompute.org/documents/ocp-mrc-1-0-pdf
- NVIDIA. "NVIDIA Spectrum-X Sets the Standard for Gigascale AI, Now With MRC." May 2026. https://blogs.nvidia.com/blog/spectrum-x-ethernet-mrc/
- AMD. "Next Gen Networking Transport for Large Scale AI Training." May 2026. https://www.amd.com/en/blogs/2026/next-gen-networking-transport-for-large-scale-ai-training.html
- AMD. "AMD and OpenAI Advance AI Networking at Scale with MRC." May 2026. https://www.amd.com/en/blogs/2026/amd-advances-ai-networking-at-scale-with-mrc.html
- Broadcom. "Enabling AI Networking @ Scale with Multi-path Reliable Connections (MRC)." May 2026. https://www.broadcom.com/blog/enabling-ai-networking-scale-with-multi-path-reliable-connections-mrc
- Dell'Oro Group. "OpenAI's MRC Initiative Reinforces Ethernet's Expanding Role in AI Back-end Networks." May 2026. https://www.delloro.com/openais-mrc-initiative-reinforces-ethernets-expanding-role-in-ai-back-end-networks/
- Futurum Group. "Can OpenAI's MRC Networking Protocol Redefine the Economics of AI Training?" May 2026. https://futurumgroup.com/insights/can-openais-mrc-networking-protocol-redefine-the-economics-of-ai-training/
- 4sysops. "Multipath Reliable Connection (MRC): a new, open networking protocol for AI supercomputers." May 2026. https://4sysops.com/archives/multipath-reliable-connection-mrc-a-new-open-networking-protocol-for-ai-supercomputers/