"The Intelligence Age Infrastructure: Inside OpenAI's Stargate and Compute Scaling Strategy"

The Infrastructure Buildout That Redefined Scale

In January 2025, OpenAI announced the Stargate project from the White House with a $500 billion headline figure. By September 2025, the plan had crystallized into something more concrete: six data center sites, nearly 7 gigawatts of planned power capacity, and over $400 billion in investment over three years. The numbers moved past press release territory into procurement contracts, land acquisitions, and construction schedules.

This is not one company building one data center. Stargate is an infrastructure program on the scale of the Interstate Highway System, measured in gigawatts and hundreds of thousands of GPUs. And OpenAI is not alone. xAI is building its own competing cluster in Memphis. Microsoft committed $80 billion in capital expenditure for Azure in fiscal year 2025 alone. Across the industry, the combined value of AI infrastructure deals involving OpenAI crossed $1 trillion by late 2025.

The stakes are straightforward. If compute is the primary bottleneck on the path to more capable AI systems, then the organizations that control the most compute will shape what those systems can do. Stargate is OpenAI's answer to that constraint. Here is what it looks like in practice.

What Stargate Is: Architecture and Partners

Stargate was announced on January 21, 2025 at the White House, framed as a joint venture between OpenAI, SoftBank, Oracle, and MGX (an Abu Dhabi-based technology investment firm). The initial commitment was $10 billion, with a stated goal of scaling to $500 billion over four years.

The partnership structure matters for understanding who builds what:

Partner	Role
OpenAI	Primary tenant and operator of compute
SoftBank	Financial backing and data center development
Oracle	Data center construction and cloud infrastructure
MGX	Strategic investment and capital allocation
Crusoe	Developer of the Abilene flagship site
CoreWeave	Additional GPU cloud infrastructure

Oracle serves as the primary builder for four of the six planned sites. SoftBank brings both capital and its own data center development capabilities. Crusoe, a cloud computing company focused on sustainable infrastructure, is building the flagship Abilene facility. CoreWeave provides supplementary GPU capacity outside the main Stargate footprint.

By September 2025, OpenAI announced an expansion that added five new sites to the original Abilene location. The company's statement was direct: "The combined capacity from these five new sites, along with our flagship site in Abilene, Texas, and ongoing projects with CoreWeave, brings Stargate to nearly 7 gigawatts of planned capacity and over $400 billion in investment over the next three years."

The Numbers: Six Sites, 7 Gigawatts, 450,000 GPUs

The Abilene flagship site anchors the entire program. Located on 875 acres in Abilene, Texas, it contains 8 buildings totaling approximately 4 million square feet. Larry Ellison confirmed the GPU count: over 450,000 Nvidia GB200 GPUs deployed across the facility. Oracle refers to this site internally as "Zettascale10."

Abilene Flagship Specifications

Metric	Value
Location	Abilene, Texas
Land area	875 acres
GPU count	450,000+ Nvidia GB200
Power capacity	1.2 GW
Equivalent homes powered	~1,000,000
Buildings	8
Total floor space	~4 million sq ft
Developer	Crusoe
Oracle designation	Zettascale10

The Five New Sites (September 2025 Announcement)

In September 2025, OpenAI named five additional sites that expanded the Stargate footprint well beyond Abilene:

Site	Location	Notes
Project Ludicrous	Abilene, TX	Expansion of existing Abilene campus
Frontier Campus	Shackelford County, TX	New construction
SoftBank Milam Data Center	Milam County, TX	SoftBank-led development
Project Jupiter	Doña Ana County, NM	New geographic region
Fifth site	TBD	Not yet disclosed

Four of these sites are being built by Oracle, with an estimated 25,000 on-site construction jobs. Combined with the Abilene flagship, the total planned capacity reaches 5.5+ GW across the named sites, approaching 7 GW when including CoreWeave and other supplementary infrastructure.

The geographic concentration in Texas is deliberate. Texas offers abundant land, relatively permissive permitting for large-scale power projects, and proximity to existing energy infrastructure. The New Mexico site (Project Jupiter) extends the footprint westward, likely to access different power grid interconnections.

The Competition: xAI, Microsoft, Google, Meta

OpenAI is building fast, but it is not building alone. The AI infrastructure race in 2025-2026 involves multiple players deploying at comparable scale.

xAI Colossus

Elon Musk's xAI operates the Colossus cluster in Memphis, Tennessee. The scale trajectory is aggressive:

Timeframe	GPU Count
End of 2025	200,000+ GPUs
January 2026	555,000 GPUs
Target	1,000,000 GPUs

The investment figure for the Colossus site alone exceeds $400 billion. Musk has stated that "Colossus 2 will be the world's first gigawatt-scale AI training supercomputer." The speed of xAI's buildout has surprised industry observers: going from zero to 200,000 GPUs in under a year required bypassing conventional data center construction timelines.

Microsoft Azure

Microsoft committed $80 billion in capital expenditure for its 2025 fiscal year, with over half allocated to US-based infrastructure. The company's 7-year commitment to OpenAI, valued at approximately $250 billion, runs from 2025 through 2031. In the UK, Microsoft partnered with Nscale to deploy what it calls the country's largest supercomputer, equipped with over 23,000 Nvidia GPUs.

Google and Meta

Google continues to expand its TPU (Tensor Processing Unit) clusters, using custom silicon rather than Nvidia GPUs. The advantage is cost efficiency at scale; the disadvantage is a narrower software ecosystem. Google's custom TPU pods, connected through proprietary interconnect fabrics, achieve training throughput that competes with Nvidia-based clusters at lower per-FLOP cost. The tradeoff is software compatibility: the TPU ecosystem lacks the breadth of libraries, frameworks, and community support that CUDA provides.

Meta has similarly invested billions in its own AI training infrastructure, combining Nvidia GPUs with custom-designed AI accelerator chips. Meta's approach differs from Google's in that it deploys Nvidia hardware for its largest frontier model training runs while using custom silicon for inference workloads where cost optimization matters more than raw capability.

The Competitive Dynamic

The competitive landscape creates a situation where the total planned AI compute capacity across all major players exceeds anything the data center industry has previously attempted. Supply chain constraints for power, cooling equipment, and fiber optic cabling have become the primary bottlenecks, not GPU manufacturing.

One underappreciated aspect of this competition is the timeline pressure it creates. When xAI deployed 200,000 GPUs in under a year, it demonstrated that construction speed, not just scale, is a competitive variable. OpenAI's multi-site Stargate strategy distributes construction risk across locations and partners, reducing the chance that a single permitting delay or supply chain disruption blocks the entire program. This is infrastructure portfolio management applied to AI compute.

Why Raw Compute Matters: Scaling Laws and Abundant Intelligence

The infrastructure investment follows from a specific technical thesis: that scaling up compute, combined with more data and algorithmic improvements, produces more capable AI systems. This is not a new idea. The trajectory from GPT-2 to GPT-4 illustrates it clearly: roughly 4.5 to 6 orders of magnitude of effective compute growth produced qualitative leaps in capability.

The scaling hypothesis suggests that current models are still far from the performance ceilings that additional compute could unlock. Sam Altman articulated the vision in his "Abundant Intelligence" essay: "If AI stays on the trajectory that we think it will, then amazing things will be possible. Maybe with 10 gigawatts of compute, AI can figure out how to cure cancer."

This is the framing that justifies spending hundreds of billions on infrastructure. If each order of magnitude of compute produces meaningful capability improvements, and if the path to AGI-level systems requires several more orders of magnitude, then the infrastructure buildout is not speculative. It is the prerequisite.

Altman has also described the operational ambition in concrete terms: "Our vision is simple: we want to create a factory that can produce a gigawatt of new AI infrastructure every week." That statement, if taken literally, implies annual infrastructure deployment rates that would dwarf anything in human industrial history.

A projection that circulated in AI research circles in 2025 suggested that by 2027, top AI labs could train a GPT-4-level model in approximately one minute. Whether or not that exact timeline materializes, the direction is clear: compute is growing faster than most people appreciate, and the capabilities that compute enables follow.

The Energy Problem: Gigawatts, Water, and Nuclear

Building data centers at this scale runs directly into energy constraints. The International Energy Agency reported in 2025 that data center electricity consumption surged significantly, with AI workloads as the primary driver.

Power Consumption Realities

Modern AI training clusters do not draw steady power. They fluctuate between maximum and minimum load almost instantaneously, depending on the training phase. This creates grid stability challenges that traditional power plants were not designed to handle. A 1.2 GW facility like Abilene cycling between 30% and 100% load creates demand spikes that require specialized grid infrastructure.

The IEA projects that US data center power consumption will double by 2030, driven primarily by AI training and inference workloads. Stargate's 7 GW of planned capacity, if fully operational, would represent a significant fraction of that growth.

Water Usage

Texas data centers consumed approximately 5 billion gallons of water in recent years for cooling. Evaporative cooling, the most common method for large-scale data centers, requires continuous water supply. Adding multiple gigawatt-scale facilities in Texas raises questions about water availability that have not been fully addressed in public planning documents.

The water constraint interacts with the power constraint in ways that compound the difficulty. Liquid cooling systems (necessary for GB200-class hardware) can operate in closed-loop configurations that reduce water consumption compared to evaporative towers, but they require more energy for chillers and heat exchangers. The engineering tradeoff between water usage and power consumption becomes a facility-level optimization problem with no clean solution, only different points on a curve.

Small Modular Reactors (SMRs)

The nuclear industry has responded to AI energy demand with a surge in SMR pre-orders. Capacity booked for SMRs grew from 25 GW at the end of 2024 to 45 GW in 2025, driven almost entirely by data center operators seeking carbon-neutral baseload power. Whether SMRs can be deployed at the pace the AI industry requires remains an open question. Current regulatory timelines for nuclear construction in the United States are measured in years, not months.

The $1 Trillion Deal Landscape

OpenAI's infrastructure procurement extends well beyond the Stargate joint venture. By late 2025, CNBC reported that the combined value of AI infrastructure deals involving OpenAI exceeded $1 trillion. The breakdown reveals the scale of commitments:

Partner	Deal Value	Details
AMD	$900B	6 GW capacity agreement
Broadcom	$350B	10 GW AI accelerator design partnership
Amazon AWS	$380B	Cloud infrastructure
Oracle	$300B	5-year cloud contract
Microsoft	$250B	7-year Azure commitment (2025-2031)
CoreWeave	$220B	GPU cloud infrastructure
Nvidia	$100B	GPU leasing and procurement

The AMD and Broadcom figures are particularly significant. AMD's 6 GW capacity agreement suggests OpenAI is hedging against Nvidia supply constraints by securing alternative silicon sources. Broadcom's involvement in AI accelerator design indicates interest in custom chip development, following a path that Google (TPU) and Amazon (Trainium/Inferentia) have already taken.

The deal structure also reveals OpenAI's multi-vendor strategy. No single provider accounts for more than a fraction of total planned capacity. This diversification reduces supply chain risk but increases integration complexity.

It also raises a question about execution. Committing to infrastructure deals worth over $1 trillion across seven or more partners requires coordinating construction timelines, hardware delivery schedules, power procurement, and network interconnection across dozens of sites. The operational complexity of managing this portfolio may prove as challenging as the technical problems the compute is meant to solve. OpenAI is simultaneously an AI research lab and one of the largest infrastructure procurement organizations on the planet. Whether a single entity can excel at both remains unproven.

Technical Architecture: The Nvidia GB200 NVL72

The GPU deployed across Stargate sites is the Nvidia GB200 NVL72, based on the Blackwell architecture. Understanding its specifications clarifies why power density has become such a challenge.

GB200 NVL72 Specifications

Component	Specification
GPUs per rack	72 Blackwell GPUs + 36 Grace CPUs
Memory	Up to 17 TB LPDDR5X + 13.5 TB HBM3e
NVLink bandwidth	130 TB/s low-latency GPU interconnect
Single GPU power draw	700-1,200 watts
Modern AI rack power (8 GPU)	~76.4 kW standard, 120-140 kW high-density

The NVLink domain of 130 TB/s is the critical number. AI training performance depends not just on individual GPU speed but on the bandwidth available for GPU-to-GPU communication. The NVL72 configuration creates a single NVLink domain spanning 72 GPUs, allowing large model shards to communicate without traversing slower network links.

The power density numbers explain why new data center construction is necessary rather than retrofitting existing facilities. A high-density rack drawing 140 kW requires specialized cooling (liquid cooling is essentially mandatory at these densities) and power distribution that conventional data centers simply do not support.

To put the density in perspective: a typical commercial office building consumes roughly 5-10 watts per square foot. A GB200 NVL72 rack in a high-density configuration consumes 140 kW in roughly 10 square feet of floor space. That is approximately 14,000 watts per square foot, three orders of magnitude beyond what conventional electrical infrastructure can deliver. Every element of the facility, from the transformer yard to the rack-level power distribution units, must be designed from scratch for AI workloads.

The memory specifications also matter for practical performance. The 13.5 TB of HBM3e memory across 72 GPUs enables keeping large model parameters on-chip during training, reducing the need to move data between GPU memory and external storage. Combined with the 130 TB/s NVLink bandwidth, this allows the 72-GPU domain to operate almost as a single processor for models that fit within the aggregate memory pool. For models larger than 13.5 TB (and frontier models are approaching this limit), the communication overhead between NVLink domains becomes the primary performance bottleneck.

What This Means for AGI and Agent Infrastructure

The infrastructure buildout is not an end in itself. It serves a thesis about what is required to build more capable AI systems, and ultimately, artificial general intelligence.

Compute as the Rate-Limiting Constraint

The current generation of frontier AI models (GPT-4 class and successors) trains on clusters of tens of thousands of GPUs for months. Moving to hundreds of thousands of GPUs, as Stargate and Colossus enable, does not merely speed up training by a linear factor. Larger clusters enable training larger models, which exhibit qualitatively different capabilities. This is the scaling hypothesis in practice.

If compute continues to be the primary bottleneck, then the organizations controlling the largest clusters will have a structural advantage in developing the most capable systems. This is why the competition between OpenAI, xAI, Google, and Meta has shifted from purely algorithmic innovation to infrastructure arms race.

Distributed Inference and Agent Architecture

Training is only half the equation. As AI systems become agents that perform extended sequences of actions (reading, writing, calling APIs, making decisions), the inference workload grows dramatically. A single agent interaction might require dozens of model calls, each consuming compute.

The industry is responding with new architectural approaches. The llm-d project, backed by Red Hat, Google, CoreWeave, and IBM, disaggregates LLM inference into separate prefill and decode phases, allowing each to scale independently. This is a shift from treating inference as a monolithic operation to treating it as a distributed systems problem.

Multi-cloud agent architecture is becoming the standard for enterprise AI deployment, as analyzed in the agent-cloud-architecture-distributed-inference framework. When agents need to orchestrate calls across multiple providers, data centers, and geographic regions, the underlying infrastructure must support low-latency, high-throughput inference at scale. Stargate and similar projects are building the training side. The inference side requires an equally deliberate architectural approach.

The Path Forward

OpenAI's Stargate program represents one vision of how to get to AGI: build the largest possible compute clusters, train the largest possible models, and trust that scaling produces capability. It is a resource-intensive approach, but the historical trajectory from GPT-2 to GPT-4 gives it empirical support.

The counterarguments are worth stating. Algorithmic improvements could reduce the compute required for equivalent capability, making raw scale less decisive. Custom silicon (Google's TPUs, Amazon's Trainium, potentially Broadcom-designed accelerators for OpenAI) could shift the economics. And energy constraints, water usage, and grid capacity could slow the buildout regardless of financial commitments.

What is not in dispute is that the 2025-2027 period will see more compute deployed for AI than in all prior years combined. The organizations building that compute are making bets measured in hundreds of billions of dollars. The outcomes of those bets will shape what AI can do for the next decade.

The Stargate program, for all its scale, is one piece of a larger transformation. The age of AI is being built on physical infrastructure: fiber optic cables, cooling towers, nuclear reactors, and silicon wafers. The intelligence that emerges from this infrastructure will be shaped as much by the properties of the hardware as by the algorithms that run on it. Understanding the infrastructure is understanding the constraints within which AI capability will evolve.

Sources:

OpenAI, "Five New Stargate Sites," September 2025. openai.com/index/five-new-stargate-sites/
OpenAI, "Building the Compute Infrastructure for the Intelligence Age." openai.com/index/building-the-compute-infrastructure-for-the-intelligence-age/
Sam Altman, "Abundant Intelligence." blog.samaltman.com/abundant-intelligence
Data Center Dynamics, "OpenAI and Oracle to Deploy 450,000 GB200 GPUs at Stargate Abilene." datacenterdynamics.com
CNBC, "A Guide to $1 Trillion Worth of AI Deals Between OpenAI, Nvidia," October 2025. cnbc.com
IEA, "Data Centre Electricity Use Surged in 2025." iea.org

Menu

Share

"The Intelligence Age Infrastructure: Inside OpenAI's Stargate and Compute Scaling Strategy"

The Infrastructure Buildout That Redefined Scale

What Stargate Is: Architecture and Partners

The Numbers: Six Sites, 7 Gigawatts, 450,000 GPUs

Abilene Flagship Specifications

The Five New Sites (September 2025 Announcement)

The Competition: xAI, Microsoft, Google, Meta

xAI Colossus

Microsoft Azure

Google and Meta

The Competitive Dynamic

Why Raw Compute Matters: Scaling Laws and Abundant Intelligence

The Energy Problem: Gigawatts, Water, and Nuclear

Power Consumption Realities

Water Usage

Small Modular Reactors (SMRs)

The $1 Trillion Deal Landscape

Technical Architecture: The Nvidia GB200 NVL72

GB200 NVL72 Specifications

What This Means for AGI and Agent Infrastructure

Compute as the Rate-Limiting Constraint

Distributed Inference and Agent Architecture

The Path Forward

Comment

"超越 Claude：Anthropic 2026 完整产品矩阵解析"

"Beyond Claude: Anthropic's Full Product Stack in 2026 — The Complete Map"

Harness Engineering 完全指南：从工业革命到 AI Agent 的约束系统设计

Klarna 的 AI 赌局：省下 6000 万美元后悄悄回调的完整时间线

"DeepMind 2026 模型生态全景：Gemini、Veo、Lyria、Genie 与 Robotics 的技术架构解析"

"AI 的绝望是安静的：Anthropic 情绪向量论文解读"

Klarna's AI Gamble: From $60M in Savings to a Quiet Reversal — The Complete Timeline

MCP vs CLI：为什么命令行正在赢得 AI Agent 的接口之争

"Agent Cloud 架构解析：Cloudflare 和 OpenAI 为什么押注分布式 AI 推理"

"AI 会替代你的工作吗？一个四维度自评框架（不是又一份安全职业清单）"