"How OpenAI Delivers Low-Latency Voice AI at Scale: Inside the WebRTC Rearchitecture Serving 900 Million Users"

On May 4, 2026, OpenAI published an engineering post describing how they rearchitected the WebRTC infrastructure powering every ChatGPT voice session and Realtime API endpoint. The post was written by Yi Zhang and William McDonald, Members of Technical Staff, and it details a specific infrastructure problem that emerges when WebRTC meets Kubernetes at global scale.

The problem is straightforward: standard WebRTC allocates one UDP port per session. At OpenAI's scale, serving over 900 million weekly active users, that means exposing tens of thousands of public UDP ports. Cloud load balancers are not designed for this. Kubernetes autoscaling, where pods are constantly added, removed, and rescheduled, makes stable port ranges impractical. And large UDP port ranges are a security auditor's nightmare.

The solution is a split architecture called "relay plus transceiver" that separates stateless packet routing from stateful protocol termination. This article covers the architecture in detail, explains why the design choices were made, and extracts the practical lessons for any team building real-time voice AI systems.

The Core Problem: WebRTC Meets Kubernetes

WebRTC is the right protocol for real-time voice AI. It provides NAT traversal, jitter buffers, congestion control, and all the mechanisms needed to handle the inherent messiness of internet audio delivery. OpenAI built their initial implementation using Pion, a mature open-source WebRTC library written in Go, and it worked. That single Go service currently powers ChatGPT voice and the Realtime API.

The problem is scaling it.

OpenAI's workload has three characteristics that define the infrastructure requirements:

Global reach. Users are everywhere. The system needs to serve voice sessions from any location with low latency.
Fast connection setup. Users expect to start speaking immediately. Every millisecond of connection setup time degrades the experience.
Stable, low-jitter media round-trip. Turn-taking in conversation requires sub-300ms end-to-end latency. Anything above this threshold makes interaction feel like a phone menu rather than a conversation.

These requirements are standard for real-time communication. What makes OpenAI's situation unique is the combination of WebRTC's protocol constraints with Kubernetes' operational model.

Why One-Port-Per-Session Fails at Scale

Standard WebRTC allocates one UDP port per session. This works fine for peer-to-peer calls or small-scale deployments. At OpenAI's traffic levels, it creates three compounding problems:

Load balancer complexity. Cloud load balancers and Kubernetes services are not designed around tens of thousands of public UDP ports per service. Each additional port range adds complexity in load balancer configuration, health checking, firewall policy, and rollout safety.

Security surface expansion. Large UDP port ranges expand the externally reachable surface area and make network policy harder to audit. For a company handling real-time audio from 900 million users, this is a significant security concern.

Autoscaling brittleness. Kubernetes constantly adds, removes, and reschedules pods. Requiring each pod to reserve and advertise a large stable port range makes elasticity impractical. The infrastructure cannot scale dynamically if every scaling event requires coordinated port range management.

The Single-Port Alternative and State Stickiness

Many WebRTC systems move toward a single UDP port per server with application-level demultiplexing behind it. This solves the port count problem but introduces a different issue: state stickiness.

Protocols like ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) are highly stateful. If packets for an established session accidentally route to a different server process, the setup fails and media breaks. The system needs a way to ensure that every packet for a given session reaches the specific process that owns that session, even when using a single shared port.

The Architecture: Relay Plus Transceiver

OpenAI's solution splits the WebRTC stack into two distinct services with different scaling properties.

The Relay: Stateless Packet Routing

The relay is a lightweight UDP forwarding layer with a small public footprint. Its job is simple: receive incoming packets, determine which transceiver should handle them, and forward them along.

The relay does not decrypt media, negotiate codecs, or run complex state machines. It does just enough work on the first STUN packet to extract the ICE username fragment (ufrag), decode routing metadata encoded there during session setup, and forward the packet to the transceiver that owns the session. Every subsequent packet, DTLS, RTP, RTCP, flows through an established session entry without re-parsing.

This design makes the relay horizontally scalable. Multiple relay instances run in parallel. State is ephemeral, small, short-timeout in-memory maps that track client-to-transceiver routing. If a relay restarts, disruption is minimal because there is no hard WebRTC state to lose. A Redis cache backs the routing table for rapid flow recovery.

The Transceiver: Stateful WebRTC Termination

The transceiver is the stateful WebRTC endpoint. It owns all protocol state: ICE, DTLS, SRTP, and the full session lifecycle. From the client's perspective, the transceiver behaves exactly like a standard WebRTC peer. The complexity of the relay layer is entirely hidden.

The transceiver terminates the client connection and converts media and events into simpler internal protocols for model inference, transcription, speech generation, tool use, and orchestration. This means inference services don't need to behave like WebRTC peers. They receive standard RPCs, not WebRTC media streams.

Built on Pion's Go implementation, the transceiver is a single service that handles both signaling negotiations and media termination while scaling like an ordinary Kubernetes workload. It connects to backend services for inference, transcription, and speech generation through internal protocols rather than through WebRTC.

The ICE Ufrag Routing Trick

The most elegant part of the architecture is how the relay determines where to send the very first packet for a new session. The traditional approach would be to query an external database or service registry, but that adds latency at the most critical moment in session setup.

OpenAI's solution uses a protocol-native field. During signaling, the transceiver encodes routing metadata directly into the server-side ICE username fragment (ufrag). When the client sends its first media-path packet, a STUN binding request, the relay parses the ufrag, decodes the routing hint, and forwards the packet to the correct transceiver cluster without any external lookup.

This is deterministic first-packet routing using information that is already present in the protocol. No database queries, no service discovery, no added latency. The routing metadata travels with the session setup and is available at the exact moment it is needed.

Global Relay: Geographic Ingress

Once the public UDP surface was reduced to a small number of stable addresses and ports, OpenAI deployed the relay pattern globally. Global Relay is a fleet of geographically distributed relay ingress points that all implement the same packet-forwarding behavior.

ICE candidate addresses observed since September 2025 show endpoints in Chicago, Virginia, and Austin, with the architecture designed to expand to additional regions as needed.

Signaling and Media Path

The signaling path uses Cloudflare geo and proximity steering. The initial HTTP or WebSocket request reaches a nearby transceiver cluster based on geographic proximity. The SDP (Session Description Protocol) answer then provides a Global Relay address close to that cluster.

The ufrag contains routing information for the designated cluster. So both the signaling path and the media path enter OpenAI's network at a geographically close point, while the session itself remains anchored to a single transceiver.

This reduces round-trip time for both signaling and the first ICE connectivity check, directly shortening how long a user waits before speech can start. Packets enter OpenAI's network at a relay close to the user, in both geography and network topology, instead of crossing the public internet to a distant region first.

What This Means in Practice

Broad geographic ingress shortens the first client-to-OpenAI hop. In practical terms:

Lower latency. Packets travel less distance on the public internet before reaching OpenAI's backbone.
Less jitter. Fewer hops through unpredictable public internet paths mean more consistent timing.
Fewer loss bursts. Avoiding congested public internet paths reduces packet loss before traffic reaches the controlled backbone.

For voice AI, where sub-300ms end-to-end latency separates conversational from phone-menu experiences, every millisecond shaved before speech can start matters directly.

Why Not an SFU?

A natural question is why OpenAI chose a transceiver model rather than a Selective Forwarding Unit (SFU), the standard architecture for multi-party WebRTC applications like video conferencing.

The answer is workload shape. Most OpenAI sessions are 1:1, one user talking to one model, or one application talking to one real-time agent. SFUs are optimized for multi-party scenarios where the unit of work is a "room" with multiple participants. For point-to-point, latency-sensitive sessions, the SFU model adds overhead without corresponding benefit.

OpenAI's design confirms that an SFU-less architecture was the right default for their workload. Inference services don't need to behave like WebRTC peers, and the system is easier to scale when the media termination layer is independent of the inference layer.

The Relay Implementation

OpenAI wrote the relay service in Go, keeping it narrow and lightweight. The key properties:

No protocol termination. The relay only parses STUN headers and ufrag. Subsequent packets are handled via cached state.
Ephemeral state. Small, short-timeout in-memory maps track client-to-transceiver routing. Redis provides backup for rapid flow recovery.
Horizontal scalability. Multiple relay instances run in parallel without coordination.
Minimal disruption on restart. Since there is no hard WebRTC state, relay restarts cause minimal session disruption.

The broader lesson, as OpenAI frames it, is that the best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior. Encoding routing metadata into a protocol-native field gave them deterministic first-packet routing, a small public UDP footprint, and enough flexibility to place ingress close to users worldwide.

Practical Lessons for Voice AI Builders

Not every team operates at OpenAI's scale, but the architectural lessons apply broadly.

1. Define Your Latency Budgets Explicitly

OpenAI's infrastructure exists to protect specific latency targets: fast connection setup, sub-300ms end-to-end media round-trip, and low jitter. Before designing infrastructure, define these targets explicitly for your workload. The targets determine every downstream architectural decision.

The critical numbers for voice AI:

Time to first audio response after user finishes speaking: ideally under 300-500ms
Barge-in reaction time: under 200ms for the system to stop output when the user interrupts
Packet loss tolerance: define the threshold where audio quality degrades unacceptably

2. Separate Routing from Protocol State

The relay-plus-transceiver pattern is worth considering for any system that needs to terminate stateful protocols on dynamically scheduled infrastructure. The principle is general: keep the thin routing layer stateless and push all hard state into a single service that can be managed carefully.

This applies beyond WebRTC. Any protocol with sticky state that conflicts with Kubernetes-style scheduling can benefit from the same separation.

3. Use Protocol-Native Fields for Routing

The ICE ufrag trick is particularly clever because it avoids external dependencies at the critical moment. Rather than introducing a database lookup or service mesh call at session setup time, OpenAI encoded routing information in a field that was already being exchanged as part of normal protocol operation.

For any protocol design, look for opportunities to piggyback routing metadata on existing handshake fields rather than introducing separate coordination mechanisms.

4. VAD and Endpointing Are Product Decisions

While not covered in OpenAI's architecture post, the surrounding discussion from their engineering team highlights a point that many voice AI teams learn the hard way: voice activity detection (VAD) and endpointing, determining when the user has finished speaking, are not purely technical parameters. They are product decisions.

Aggressive endpointing reduces latency but can cut off users who pause mid-sentence. Conservative endpointing feels more natural but increases perceived latency. The right balance depends on your specific use case and user expectations.

5. WebRTC Over WebSockets for Production Voice

OpenAI's infrastructure investment confirms that WebRTC, not WebSockets, is the right protocol for production voice AI at scale. WebSockets work for prototyping and low-stakes applications, but they run over TCP, which introduces head-of-line blocking and retransmission delays that are unacceptable for real-time audio.

WebRTC's UDP-based transport with built-in NAT traversal, jitter buffers, and congestion control provides the reliability properties that voice AI requires. If you are building a voice product that needs to work consistently across network conditions, start with WebRTC.

Platform Comparison: Building Voice AI in 2026

For teams deciding how to build voice AI products today, there are three practical paths at different cost and control points.

Dimension	OpenAI Realtime API	LiveKit Agents	Build Your Own (Pion)
Wire protocol	WebRTC (or WebSocket fallback)	WebRTC (LiveKit SFU)	WebRTC (custom)
Hosting	OpenAI cloud only	Self-host or LiveKit Cloud	Your infrastructure
Model flexibility	OpenAI models only	Any model	Any model
Time to prototype	Hours	Days	Weeks to months
Multi-agent rooms	No	Yes	Custom build
Scale ceiling	OpenAI's infrastructure	Depends on your deployment	Unlimited (your problem)

The practical recommendation for most teams: use a managed service to ship the first version, learn what actually breaks at scale, then revisit the build-vs-buy decision. OpenAI's own writeup makes clear that the routing layer complexity only became necessary at their specific traffic levels.

FAQ

Why did OpenAI rebuild their WebRTC stack?

The traditional one-port-per-session WebRTC model conflicts with Kubernetes at scale. It requires exposing tens of thousands of public UDP ports, which creates load balancer complexity, security surface expansion, and autoscaling brittleness. The relay-plus-transceiver architecture solves this by reducing the public UDP footprint to a small number of stable addresses.

What is the ICE ufrag routing trick?

During WebRTC session setup, OpenAI's transceiver encodes routing metadata into the server-side ICE username fragment (ufrag). When the client sends its first STUN binding request, the relay parses the ufrag to determine which transceiver should handle the session. This enables deterministic first-packet routing without external database lookups.

Does this architecture apply to teams not operating at OpenAI's scale?

The specific relay-plus-transceiver split is most valuable when you need to run WebRTC on dynamically scheduled infrastructure like Kubernetes and have enough concurrent sessions to make port management painful. For smaller deployments, a single Go service using Pion, which is what OpenAI started with, works fine.

Why not use WebSockets instead of WebRTC for voice AI?

WebSockets run over TCP, which introduces head-of-line blocking and retransmission delays. For real-time audio, these delays cause audible artifacts and increased latency. WebRTC uses UDP with built-in NAT traversal, jitter buffers, and congestion control, making it the appropriate protocol for production voice AI.

What latency targets should voice AI systems aim for?

Sub-300ms end-to-end latency is the threshold for conversational interaction. Time to first audio response after the user finishes speaking should be under 300-500ms. Barge-in reaction time should be under 200ms. These targets are not arbitrary; they reflect the thresholds where users perceive the interaction as natural versus delayed.

References

OpenAI Engineering: How OpenAI delivers low-latency voice AI at scale (May 4, 2026)
Pion WebRTC: github.com/pion/webrtc (open-source Go WebRTC library)
OpenAI Developers: Updates for developers building with voice
Related: Agent Cloud Architecture: Why Cloudflare and OpenAI Are Betting on Distributed AI Inference
Related: Claude Sonnet 4.6 Deep Dive: How Anthropic Achieved Frontier Performance in Coding and Agents

Menu

Share