The Infrastructure Buildout That Redefined Scale
In January 2025, OpenAI announced the Stargate project from the White House with a $500 billion headline figure. By September 2025, the plan had crystallized into something more concrete: six data center sites, nearly 7 gigawatts of planned power capacity, and over $400 billion in investment over three years. The numbers moved past press release territory into procurement contracts, land acquisitions, and construction schedules.
This is not one company building one data center. Stargate is an infrastructure program on the scale of the Interstate Highway System, measured in gigawatts and hundreds of thousands of GPUs. And OpenAI is not alone. xAI is building its own competing cluster in Memphis. Microsoft committed $80 billion in capital expenditure for Azure in fiscal year 2025 alone. Across the industry, the combined value of AI infrastructure deals involving OpenAI crossed $1 trillion by late 2025.
The stakes are straightforward. If compute is the primary bottleneck on the path to more capable AI systems, then the organizations that control the most compute will shape what those systems can do. Stargate is OpenAI's answer to that constraint. Here is what it looks like in practice.
What Stargate Is: Architecture and Partners
Stargate was announced on January 21, 2025 at the White House, framed as a joint venture between OpenAI, SoftBank, Oracle, and MGX (an Abu Dhabi-based technology investment firm). The initial commitment was $10 billion, with a stated goal of scaling to $500 billion over four years.
The partnership structure matters for understanding who builds what:
| Partner | Role |
|---|---|
| OpenAI | Primary tenant and operator of compute |
| SoftBank | Financial backing and data center development |
| Oracle | Data center construction and cloud infrastructure |
| MGX | Strategic investment and capital allocation |
| Crusoe | Developer of the Abilene flagship site |
| CoreWeave | Additional GPU cloud infrastructure |
Oracle serves as the primary builder for four of the six planned sites. SoftBank brings both capital and its own data center development capabilities. Crusoe, a cloud computing company focused on sustainable infrastructure, is building the flagship Abilene facility. CoreWeave provides supplementary GPU capacity outside the main Stargate footprint.
By September 2025, OpenAI announced an expansion that added five new sites to the original Abilene location. The company's statement was direct: "The combined capacity from these five new sites, along with our flagship site in Abilene, Texas, and ongoing projects with CoreWeave, brings Stargate to nearly 7 gigawatts of planned capacity and over $400 billion in investment over the next three years."
The Numbers: Six Sites, 7 Gigawatts, 450,000 GPUs
The Abilene flagship site anchors the entire program. Located on 875 acres in Abilene, Texas, it contains 8 buildings totaling approximately 4 million square feet. Larry Ellison confirmed the GPU count: over 450,000 Nvidia GB200 GPUs deployed across the facility. Oracle refers to this site internally as "Zettascale10."
Abilene Flagship Specifications
| Metric | Value |
|---|---|
| Location | Abilene, Texas |
| Land area | 875 acres |
| GPU count | 450,000+ Nvidia GB200 |
| Power capacity | 1.2 GW |
| Equivalent homes powered | ~1,000,000 |
| Buildings | 8 |
| Total floor space | ~4 million sq ft |
| Developer | Crusoe |
| Oracle designation | Zettascale10 |
The Five New Sites (September 2025 Announcement)
In September 2025, OpenAI named five additional sites that expanded the Stargate footprint well beyond Abilene:
| Site | Location | Notes |
|---|---|---|
| Project Ludicrous | Abilene, TX | Expansion of existing Abilene campus |
| Frontier Campus | Shackelford County, TX | New construction |
| SoftBank Milam Data Center | Milam County, TX | SoftBank-led development |
| Project Jupiter | Doña Ana County, NM | New geographic region |
| Fifth site | TBD | Not yet disclosed |
Four of these sites are being built by Oracle, with an estimated 25,000 on-site construction jobs. Combined with the Abilene flagship, the total planned capacity reaches 5.5+ GW across the named sites, approaching 7 GW when including CoreWeave and other supplementary infrastructure.
The geographic concentration in Texas is deliberate. Texas offers abundant land, relatively permissive permitting for large-scale power projects, and proximity to existing energy infrastructure. The New Mexico site (Project Jupiter) extends the footprint westward, likely to access different power grid interconnections.
The Competition: xAI, Microsoft, Google, Meta
OpenAI is building fast, but it is not building alone. The AI infrastructure race in 2025-2026 involves multiple players deploying at comparable scale.
xAI Colossus
Elon Musk's xAI operates the Colossus cluster in Memphis, Tennessee. The scale trajectory is aggressive:
| Timeframe | GPU Count |
|---|---|
| End of 2025 | 200,000+ GPUs |
| January 2026 | 555,000 GPUs |
| Target | 1,000,000 GPUs |
The investment figure for the Colossus site alone exceeds $400 billion. Musk has stated that "Colossus 2 will be the world's first gigawatt-scale AI training supercomputer." The speed of xAI's buildout has surprised industry observers: going from zero to 200,000 GPUs in under a year required bypassing conventional data center construction timelines.
Microsoft Azure
Microsoft committed $80 billion in capital expenditure for its 2025 fiscal year, with over half allocated to US-based infrastructure. The company's 7-year commitment to OpenAI, valued at approximately $250 billion, runs from 2025 through 2031. In the UK, Microsoft partnered with Nscale to deploy what it calls the country's largest supercomputer, equipped with over 23,000 Nvidia GPUs.
Google and Meta
Google continues to expand its TPU (Tensor Processing Unit) clusters, using custom silicon rather than Nvidia GPUs. The advantage is cost efficiency at scale; the disadvantage is a narrower software ecosystem. Google's custom TPU pods, connected through proprietary interconnect fabrics, achieve training throughput that competes with Nvidia-based clusters at lower per-FLOP cost. The tradeoff is software compatibility: the TPU ecosystem lacks the breadth of libraries, frameworks, and community support that CUDA provides.
Meta has similarly invested billions in its own AI training infrastructure, combining Nvidia GPUs with custom-designed AI accelerator chips. Meta's approach differs from Google's in that it deploys Nvidia hardware for its largest frontier model training runs while using custom silicon for inference workloads where cost optimization matters more than raw capability.
The Competitive Dynamic
The competitive landscape creates a situation where the total planned AI compute capacity across all major players exceeds anything the data center industry has previously attempted. Supply chain constraints for power, cooling equipment, and fiber optic cabling have become the primary bottlenecks, not GPU manufacturing.
One underappreciated aspect of this competition is the timeline pressure it creates. When xAI deployed 200,000 GPUs in under a year, it demonstrated that construction speed, not just scale, is a competitive variable. OpenAI's multi-site Stargate strategy distributes construction risk across locations and partners, reducing the chance that a single permitting delay or supply chain disruption blocks the entire program. This is infrastructure portfolio management applied to AI compute.
Why Raw Compute Matters: Scaling Laws and Abundant Intelligence
The infrastructure investment follows from a specific technical thesis: that scaling up compute, combined with more data and algorithmic improvements, produces more capable AI systems. This is not a new idea. The trajectory from GPT-2 to GPT-4 illustrates it clearly: roughly 4.5 to 6 orders of magnitude of effective compute growth produced qualitative leaps in capability.
The scaling hypothesis suggests that current models are still far from the performance ceilings that additional compute could unlock. Sam Altman articulated the vision in his "Abundant Intelligence" essay: "If AI stays on the trajectory that we think it will, then amazing things will be possible. Maybe with 10 gigawatts of compute, AI can figure out how to cure cancer."
This is the framing that justifies spending hundreds of billions on infrastructure. If each order of magnitude of compute produces meaningful capability improvements, and if the path to AGI-level systems requires several more orders of magnitude, then the infrastructure buildout is not speculative. It is the prerequisite.
Altman has also described the operational ambition in concrete terms: "Our vision is simple: we want to create a factory that can produce a gigawatt of new AI infrastructure every week." That statement, if taken literally, implies annual infrastructure deployment rates that would dwarf anything in human industrial history.
A projection that circulated in AI research circles in 2025 suggested that by 2027, top AI labs could train a GPT-4-level model in approximately one minute. Whether or not that exact timeline materializes, the direction is clear: compute is growing faster than most people appreciate, and the capabilities that compute enables follow.
The Energy Problem: Gigawatts, Water, and Nuclear
Building data centers at this scale runs directly into energy constraints. The International Energy Agency reported in 2025 that data center electricity consumption surged significantly, with AI workloads as the primary driver.
Power Consumption Realities
Modern AI training clusters do not draw steady power. They fluctuate between maximum and minimum load almost instantaneously, depending on the training phase. This creates grid stability challenges that traditional power plants were not designed to handle. A 1.2 GW facility like Abilene cycling between 30% and 100% load creates demand spikes that require specialized grid infrastructure.
The IEA projects that US data center power consumption will double by 2030, driven primarily by AI training and inference workloads. Stargate's 7 GW of planned capacity, if fully operational, would represent a significant fraction of that growth.
Water Usage
Texas data centers consumed approximately 5 billion gallons of water in recent years for cooling. Evaporative cooling, the most common method for large-scale data centers, requires continuous water supply. Adding multiple gigawatt-scale facilities in Texas raises questions about water availability that have not been fully addressed in public planning documents.
The water constraint interacts with the power constraint in ways that compound the difficulty. Liquid cooling systems (necessary for GB200-class hardware) can operate in closed-loop configurations that reduce water consumption compared to evaporative towers, but they require more energy for chillers and heat exchangers. The engineering tradeoff between water usage and power consumption becomes a facility-level optimization problem with no clean solution, only different points on a curve.
Small Modular Reactors (SMRs)
The nuclear industry has responded to AI energy demand with a surge in SMR pre-orders. Capacity booked for SMRs grew from 25 GW at the end of 2024 to 45 GW in 2025, driven almost entirely by data center operators seeking carbon-neutral baseload power. Whether SMRs can be deployed at the pace the AI industry requires remains an open question. Current regulatory timelines for nuclear construction in the United States are measured in years, not months.
The $1 Trillion Deal Landscape
OpenAI's infrastructure procurement extends well beyond the Stargate joint venture. By late 2025, CNBC reported that the combined value of AI infrastructure deals involving OpenAI exceeded $1 trillion. The breakdown reveals the scale of commitments:
| Partner | Deal Value | Details |
|---|---|---|
| AMD | $900B | 6 GW capacity agreement |
| Broadcom | $350B | 10 GW AI accelerator design partnership |
| Amazon AWS | $380B | Cloud infrastructure |
| Oracle | $300B | 5-year cloud contract |
| Microsoft | $250B | 7-year Azure commitment (2025-2031) |
| CoreWeave | $220B | GPU cloud infrastructure |
| Nvidia | $100B | GPU leasing and procurement |
The AMD and Broadcom figures are particularly significant. AMD's 6 GW capacity agreement suggests OpenAI is hedging against Nvidia supply constraints by securing alternative silicon sources. Broadcom's involvement in AI accelerator design indicates interest in custom chip development, following a path that Google (TPU) and Amazon (Trainium/Inferentia) have already taken.
The deal structure also reveals OpenAI's multi-vendor strategy. No single provider accounts for more than a fraction of total planned capacity. This diversification reduces supply chain risk but increases integration complexity.
It also raises a question about execution. Committing to infrastructure deals worth over $1 trillion across seven or more partners requires coordinating construction timelines, hardware delivery schedules, power procurement, and network interconnection across dozens of sites. The operational complexity of managing this portfolio may prove as challenging as the technical problems the compute is meant to solve. OpenAI is simultaneously an AI research lab and one of the largest infrastructure procurement organizations on the planet. Whether a single entity can excel at both remains unproven.
Technical Architecture: The Nvidia GB200 NVL72
The GPU deployed across Stargate sites is the Nvidia GB200 NVL72, based on the Blackwell architecture. Understanding its specifications clarifies why power density has become such a challenge.
GB200 NVL72 Specifications
| Component | Specification |
|---|---|
| GPUs per rack | 72 Blackwell GPUs + 36 Grace CPUs |
| Memory | Up to 17 TB LPDDR5X + 13.5 TB HBM3e |
| NVLink bandwidth | 130 TB/s low-latency GPU interconnect |
| Single GPU power draw | 700-1,200 watts |
| Modern AI rack power (8 GPU) | ~76.4 kW standard, 120-140 kW high-density |
The NVLink domain of 130 TB/s is the critical number. AI training performance depends not just on individual GPU speed but on the bandwidth available for GPU-to-GPU communication. The NVL72 configuration creates a single NVLink domain spanning 72 GPUs, allowing large model shards to communicate without traversing slower network links.
The power density numbers explain why new data center construction is necessary rather than retrofitting existing facilities. A high-density rack drawing 140 kW requires specialized cooling (liquid cooling is essentially mandatory at these densities) and power distribution that conventional data centers simply do not support.
To put the density in perspective: a typical commercial office building consumes roughly 5-10 watts per square foot. A GB200 NVL72 rack in a high-density configuration consumes 140 kW in roughly 10 square feet of floor space. That is approximately 14,000 watts per square foot, three orders of magnitude beyond what conventional electrical infrastructure can deliver. Every element of the facility, from the transformer yard to the rack-level power distribution units, must be designed from scratch for AI workloads.
The memory specifications also matter for practical performance. The 13.5 TB of HBM3e memory across 72 GPUs enables keeping large model parameters on-chip during training, reducing the need to move data between GPU memory and external storage. Combined with the 130 TB/s NVLink bandwidth, this allows the 72-GPU domain to operate almost as a single processor for models that fit within the aggregate memory pool. For models larger than 13.5 TB (and frontier models are approaching this limit), the communication overhead between NVLink domains becomes the primary performance bottleneck.
What This Means for AGI and Agent Infrastructure
The infrastructure buildout is not an end in itself. It serves a thesis about what is required to build more capable AI systems, and ultimately, artificial general intelligence.
Compute as the Rate-Limiting Constraint
The current generation of frontier AI models (GPT-4 class and successors) trains on clusters of tens of thousands of GPUs for months. Moving to hundreds of thousands of GPUs, as Stargate and Colossus enable, does not merely speed up training by a linear factor. Larger clusters enable training larger models, which exhibit qualitatively different capabilities. This is the scaling hypothesis in practice.
If compute continues to be the primary bottleneck, then the organizations controlling the largest clusters will have a structural advantage in developing the most capable systems. This is why the competition between OpenAI, xAI, Google, and Meta has shifted from purely algorithmic innovation to infrastructure arms race.
Distributed Inference and Agent Architecture
Training is only half the equation. As AI systems become agents that perform extended sequences of actions (reading, writing, calling APIs, making decisions), the inference workload grows dramatically. A single agent interaction might require dozens of model calls, each consuming compute.
The industry is responding with new architectural approaches. The llm-d project, backed by Red Hat, Google, CoreWeave, and IBM, disaggregates LLM inference into separate prefill and decode phases, allowing each to scale independently. This is a shift from treating inference as a monolithic operation to treating it as a distributed systems problem.
Multi-cloud agent architecture is becoming the standard for enterprise AI deployment, as analyzed in the agent-cloud-architecture-distributed-inference framework. When agents need to orchestrate calls across multiple providers, data centers, and geographic regions, the underlying infrastructure must support low-latency, high-throughput inference at scale. Stargate and similar projects are building the training side. The inference side requires an equally deliberate architectural approach.
The Path Forward
OpenAI's Stargate program represents one vision of how to get to AGI: build the largest possible compute clusters, train the largest possible models, and trust that scaling produces capability. It is a resource-intensive approach, but the historical trajectory from GPT-2 to GPT-4 gives it empirical support.
The counterarguments are worth stating. Algorithmic improvements could reduce the compute required for equivalent capability, making raw scale less decisive. Custom silicon (Google's TPUs, Amazon's Trainium, potentially Broadcom-designed accelerators for OpenAI) could shift the economics. And energy constraints, water usage, and grid capacity could slow the buildout regardless of financial commitments.
What is not in dispute is that the 2025-2027 period will see more compute deployed for AI than in all prior years combined. The organizations building that compute are making bets measured in hundreds of billions of dollars. The outcomes of those bets will shape what AI can do for the next decade.
The Stargate program, for all its scale, is one piece of a larger transformation. The age of AI is being built on physical infrastructure: fiber optic cables, cooling towers, nuclear reactors, and silicon wafers. The intelligence that emerges from this infrastructure will be shaped as much by the properties of the hardware as by the algorithms that run on it. Understanding the infrastructure is understanding the constraints within which AI capability will evolve.
Sources:
- OpenAI, "Five New Stargate Sites," September 2025. openai.com/index/five-new-stargate-sites/
- OpenAI, "Building the Compute Infrastructure for the Intelligence Age." openai.com/index/building-the-compute-infrastructure-for-the-intelligence-age/
- Sam Altman, "Abundant Intelligence." blog.samaltman.com/abundant-intelligence
- Data Center Dynamics, "OpenAI and Oracle to Deploy 450,000 GB200 GPUs at Stargate Abilene." datacenterdynamics.com
- CNBC, "A Guide to $1 Trillion Worth of AI Deals Between OpenAI, Nvidia," October 2025. cnbc.com
- IEA, "Data Centre Electricity Use Surged in 2025." iea.org