"GPT-5.5 Technical Deep Dive: How OpenAI's Latest Model Achieves New Frontiers in Coding and Reasoning"

What Is GPT-5.5?

OpenAI released GPT-5.5 on April 23, 2026, with a blog post from Greg Brockman. The internal codename is "Spud." The positioning is straightforward: this is the first fully retrained foundation model OpenAI has shipped since GPT-4.5. Every model from GPT-5.0 through GPT-5.4 shared the same pretraining base and improved through post-training iterations. GPT-5.5 was rebuilt from the pretraining layer up.

That distinction matters. Post-training can optimize what a model already knows. It cannot create fundamentally new capabilities that the base model never learned. GPT-5.5's jumps in long-context retrieval, agentic coding, and mathematical reasoning are the kinds of improvements that require a new pretraining foundation, not just more RLHF.

The pricing tells part of the story. Standard API access costs $5 per million input tokens and $30 per million output tokens. That is double GPT-5.4's $2.50/$15. Batch and Flex pricing is $2.50/$15. The Pro variant, designed for research-grade reasoning, costs $30/$180 per million tokens. GPT-5.5 is now the most expensive standard frontier model on the market.

But OpenAI claims GPT-5.5 uses approximately 40% fewer output tokens for equivalent tasks compared to GPT-5.4. If that claim holds in production, the real cost increase is closer to 20% rather than 100%. The model also matches GPT-5.4's per-token latency despite the larger parameter count, which suggests significant inference optimization work.

The context window is 1 million tokens for input and 128K for output. The Codex integration uses a 400K context window. Three variants are available: Standard, Thinking (with visible chain-of-thought), and Pro (for the hardest reasoning tasks). ChatGPT Plus subscribers get 160 messages per 3 hours plus 3,000 Thinking messages per week. ChatGPT Pro at $200 per month offers unlimited messages. A new $100 tier offers 5x Codex usage for developers who need sustained coding agent sessions.

Here is the complete benchmark scorecard.

The Complete Benchmark Scorecard

Coding Benchmarks

Benchmark	GPT-5.5	GPT-5.4	Delta	Notes
SWE-bench Verified	88.7%	74.9%	+13.8	Real GitHub issue resolution
SWE-bench Pro	58.6%	57.7%	+0.9	Multi-file, agentic environment
Expert-SWE (20hr tasks)	73.1%	68.5%	+4.6	Long-horizon engineering
Terminal-Bench 2.0	82.7%	75.1%	+7.6	All-time high across all models
HumanEval	~95%+	~95%+	Saturated	No longer useful for frontier comparison

Agentic Tool Use Benchmarks

Benchmark	GPT-5.5	Claude Opus 4.7	Notes
GDPval (44 occupations)	84.9%	80.3%	Knowledge work automation
OSWorld-Verified	78.7%	78.0%	GUI automation, above human baseline (72.4%)
MCP-Atlas	75.3%	79.1%	Claude leads on tool protocol standard
Tau2-bench Telecom	98.0%	—	Complex customer service workflows
Toolathlon	55.6%	—	Multi-tool coordination

Reasoning Benchmarks

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Gemini 3.1 Pro
MMLU	92.4%	—	—	—
GPQA Diamond	93.6%	—	94.2%	94.3%
FrontierMath T1-3	51.7%	52.4%	43.8%	—
FrontierMath T4	35.4%	39.6%	22.9%	16.7%
ARC-AGI-2	85.0%	—	—	77.1%
HLE (no tools)	41.4%	43.1%	46.9%	—

Long Context (MRCR v2)

Context Range	GPT-5.5	GPT-5.4	Claude Opus 4.7
4K-8K	98.1%	97.3%	—
128K-256K	87.5%	79.3%	59.2%
256K-512K	81.5%	57.5%	—
512K-1M	74.0%	36.6%	32.2%
Graphwalks BFS	45.4%	9.4%	—

Safety and Cybersecurity

Benchmark	GPT-5.5	GPT-5.4
CyberGym	81.8%	79.0%
Internal CTF	88.1%	83.7%
Preparedness Rating	High (not Critical)	High (not Critical)

Where GPT-5.5 Dominates

The numbers that should shape your planning are these:

Terminal-Bench 2.0 at 82.7%. This is the highest score ever recorded on this benchmark, which tests complex command-line workflows: file operations, script execution, debugging, and tool coordination across multiple steps. GPT-5.5 leads Claude Opus 4.7 by 13.3 points and Gemini 3.1 Pro by 14.2 points. If your use case involves terminal-based agent work, this is a decisive advantage.

SWE-bench Verified at 88.7%. The +13.8 point jump from GPT-5.4's 74.9% is the largest single-version improvement OpenAI has delivered on this benchmark. Real GitHub issue resolution is the most production-relevant coding benchmark available, and GPT-5.5 now leads every competitor.

FrontierMath Tier 4 at 35.4% (Pro: 39.6%). FrontierMath Tier 4 is the hardest mathematical reasoning benchmark in existence. GPT-5.5 Pro at 39.6% is 1.73x Claude Opus 4.7's 22.9% and 2.37x Gemini 3.1 Pro's 16.7%. This is not a post-training optimization. It is evidence that the new pretraining foundation genuinely improved mathematical reasoning at the frontier.

MRCR 512K-1M at 74.0%. The jump from GPT-5.4's 36.6% to 74.0% is a 102% improvement. Claude Opus 4.7 scores 32.2% at the same range. GPT-5.5 is the first OpenAI model that makes a 1 million token context window practically usable for retrieval tasks. The Graphwalks BFS benchmark shows a 5x improvement (45.4% vs 9.4%), confirming that the model can navigate complex information structures across very long documents.

GDPval at 84.9%. This benchmark tests real-world knowledge work across 44 occupations. GPT-5.5 leads Claude Opus 4.7 by 4.6 points. For enterprise automation use cases that involve document analysis, data processing, and multi-step reasoning across domain knowledge, this is the most representative benchmark available.

Where GPT-5.5 Falls Short

Honest analysis requires acknowledging the gaps.

SWE-bench Pro at 58.6% vs Claude Opus 4.7's 64.3%. Despite the massive SWE-bench Verified improvement, GPT-5.5 gains only 0.9 points on the harder Pro variant. Claude Opus 4.7 leads by 5.7 points. Claude Mythos Preview, Anthropic's restricted frontier model, scores 77.8%. OpenAI has noted that some competitors show signs of memorization on SWE-bench Pro, though they have not named specific models. The gap is real and meaningful for teams building agents that handle complex multi-file codebases.

Hallucination rate at 86%. This is the most serious concern. Artificial Analysis, an independent evaluation platform, found GPT-5.5 hallucinates at 86% on their test suite. Claude Opus 4.7 hallucinates at 36%. Gemini 3.1 Pro at 50%. GPT-5.5 knows more than any competitor, but when it does not know, it fabricates answers with high confidence rather than admitting uncertainty.

This is a structural risk, not a minor issue. For legal research, medical analysis, financial reporting, or any application where factual precision is non-negotiable, an 86% hallucination rate is a deployment blocker. Teams using GPT-5.5 for knowledge work need verification pipelines that treat every factual claim as suspect until independently confirmed.

16K-64K context regression. In the 16K-64K range, GPT-5.5 scores approximately 91% on MRCR v2, slightly below GPT-5.4's approximately 93%. The model was optimized for the extremes: very short contexts (4K-8K at 98.1%) and very long contexts (512K-1M at 74.0%). The mid-range sacrificed a small amount of performance. For most applications this regression is invisible, but it is worth noting for teams with workloads concentrated in this range.

HLE (no tools) at 41.4% vs Claude Opus 4.7's 46.9%. The Hard Language Evaluation tests reasoning without external tools. Claude Opus 4.7 leads by 5.5 points. This suggests that for pure reasoning tasks without tool access, Claude still has an edge.

The Architecture Shift: Why This Is Not Just Another Post-Training Update

GPT-5.0 through GPT-5.4 shared the same pretraining foundation. Each version improved through post-training: RLHF, instruction tuning, distillation, and reasoning optimization. This is the standard playbook for incremental model releases.

GPT-5.5 broke from that pattern. It is the first fully retrained base model since GPT-4.5, with agent-native training objectives embedded at the pretraining layer rather than added afterward.

The practical implication: post-training can optimize what a model already knows, but it cannot teach fundamentally new capabilities. The 102% improvement in long-context retrieval, the 13.3 point lead on Terminal-Bench, and the 1.73x advantage on FrontierMath Tier 4 are the kinds of gains that require the model to have learned different patterns during pretraining. No amount of RLHF on a GPT-5.4 base would produce these results.

OpenAI also claims GPT-5.5 was trained on NVIDIA GB200/GB300 NVL72 systems, which represents a significant hardware upgrade from the training infrastructure used for GPT-5.0. The combination of new hardware, new pretraining data, and agent-native objectives explains why GPT-5.5 feels like a generational shift rather than an incremental release.

Long Context: The 1M Window Is Finally Real

OpenAI has offered 1 million token context windows since GPT-5.4. The problem was never the window size. It was the retrieval quality within that window.

GPT-5.4's MRCR score at 512K-1M was 36.6%. That means if you buried a specific fact in the middle of a 700K token document, GPT-5.4 would find it roughly one-third of the time. The window was technically open. It was not practically useful for retrieval-dependent tasks.

GPT-5.5 raises that to 74.0%. At 128K-256K, the score is 87.5%. These are the first numbers from OpenAI that make long-context retrieval genuinely viable for production use.

The Graphwalks BFS result is equally telling: 45.4% vs GPT-5.4's 9.4%. This benchmark tests whether the model can follow references and relationships across a large document structure, similar to navigating a codebase or a research paper with extensive cross-references. A 5x improvement means GPT-5.5 can actually use the structure of long documents rather than just processing them as flat text.

For practical applications: legal document review, codebase analysis across entire repositories, research synthesis from hundreds of papers, and enterprise knowledge base queries are all use cases that were theoretically possible with GPT-5.4 but unreliable in practice. GPT-5.5 makes them viable.

The one caveat: the 16K-64K regression. If your workload is concentrated in this range, test carefully. The optimization for extremes came at a small cost in the middle.

GPT-5.5 vs Claude Opus 4.7: The Decision Matrix

Criterion	GPT-5.5	Claude Opus 4.7	Winner
SWE-bench Verified	88.7%	~82%	GPT-5.5
SWE-bench Pro	58.6%	64.3%	Opus 4.7
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
FrontierMath T4	35.4%	22.9%	GPT-5.5
GDPval (44 occupations)	84.9%	80.3%	GPT-5.5
MCP-Atlas	75.3%	79.1%	Opus 4.7
HLE (no tools)	41.4%	46.9%	Opus 4.7
OSWorld-Verified	78.7%	78.0%	GPT-5.5
Hallucination rate	86%	36%	Opus 4.7
Output price ($/1M)	$30	$25	Opus 4.7
Context window	1M / 128K	200K / 128K	GPT-5.5
LMSYS Chatbot Arena	—	1504 Elo (#1)	Opus 4.7

Choose GPT-5.5 when: Your primary use case is terminal-based agent work, long-context document analysis, mathematical reasoning, or autonomous coding workflows where SWE-bench Verified is your proxy for production performance. The 1M context window and Terminal-Bench dominance make it the clear choice for agentic applications that span large codebases or documents.

Choose Claude Opus 4.7 when: You need the lowest hallucination rate for knowledge work, you are building MCP-based tool integrations, your tasks require complex multi-file codebase understanding (SWE-bench Pro), or you prioritize human preference alignment (LMSYS #1 ranking). The 36% hallucination rate vs GPT-5.5's 86% is a decisive factor for applications where factual precision matters more than raw capability.

The Terminal-Bench gap is GPT-5.5's most significant advantage: 13.3 points is not marginal. For teams building CLI-based agents, this is the differentiator. The SWE-bench Pro gap is Opus 4.7's most significant advantage: 5.7 points on the hardest coding benchmark means Claude still leads on the most complex engineering tasks.

GPT-5.5 vs Gemini 3.1 Pro

Criterion	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	88.7%	78.8%
Terminal-Bench 2.0	82.7%	68.5%
FrontierMath T4	35.4%	16.7%
MRCR 512K-1M	74.0%	—
Output price ($/1M)	$30	$12
Context window	1M	2M
Multimodal (audio/video)	No	Yes (native)

Gemini 3.1 Pro is significantly cheaper ($12 vs $30 per million output tokens) and offers a 2M token context window with native audio and video support. GPT-5.5 dominates on coding, agentic, and reasoning benchmarks. The choice depends on whether your application needs Gemini's multimodal capabilities or GPT-5.5's agentic performance.

Pricing and Token Efficiency

The headline price increase is 2x. The practical cost increase is closer to 20%.

OpenAI's claim: GPT-5.5 uses approximately 40% fewer output tokens for equivalent tasks. If a GPT-5.4 task required 1,000 output tokens at $15 per million, the cost was $0.015. The same task on GPT-5.5 requires 600 tokens at $30 per million, costing $0.018. The increase is 20%, not 100%.

This efficiency gain comes from better reasoning compression: GPT-5.5 produces more concise, structured outputs that require fewer tokens to express the same content. For coding tasks specifically, the model generates more targeted code with less explanatory fluff.

However, this efficiency claim needs validation in your specific use case. Token efficiency varies by task type. For tasks that require extensive reasoning traces or verbose explanations, the savings may be smaller. For tasks where conciseness is valued, the savings may be larger.

The Pro variant at $30/$180 is priced for research-grade workloads where the FrontierMath T4 advantage (39.6% vs 35.4% for standard) justifies the premium. For most production applications, the standard variant is the rational choice.

What the System Card Reveals

OpenAI published a full System Card alongside GPT-5.5, which is worth reading in full. The key findings:

Preparedness Framework ratings: High across all categories. Biological, chemical, and cybersecurity capabilities all rate "High," which is below the "Critical" threshold that would trigger additional deployment restrictions. GPT-5.5 does not possess the capability to develop functional zero-day vulnerabilities without human intervention.

Cybersecurity improvement over GPT-5.4. CyberGym scores improved from 79.0% to 81.8%. Internal CTF tasks improved from 83.7% to 88.1%. The model is better at security tasks than its predecessor, but not dangerously so.

~200 early access partners provided real-world feedback before general release. This is a larger early access program than OpenAI has run for previous models, suggesting more cautious deployment practices.

HealthBench evaluation tested medical performance and safety. The results are not publicly disclosed in detail, but the inclusion of this benchmark signals that OpenAI is taking medical use cases seriously in their safety evaluation.

The System Card is notable for what it does not claim. There is no assertion that GPT-5.5 represents a qualitative shift in dangerous capabilities. The ratings are consistent with incremental improvement within the "High" band, not a leap to "Critical."

FAQ

What is GPT-5.5?

GPT-5.5 is OpenAI's latest foundation model, released April 23, 2026. It is the first fully retrained base model since GPT-4.5, with agent-native training objectives embedded at the pretraining layer. It comes in three variants: Standard, Thinking (with visible chain-of-thought), and Pro (for research-grade reasoning).

How much does GPT-5.5 cost?

Standard API: $5 per million input tokens, $30 per million output tokens. Batch/Flex: $2.50/$15. Pro: $30/$180. ChatGPT Plus ($20/month) includes 160 messages per 3 hours plus 3,000 Thinking messages per week. ChatGPT Pro ($200/month) offers unlimited messages. A new $100 tier offers 5x Codex usage.

How does GPT-5.5 compare to Claude Opus 4.7?

GPT-5.5 leads on Terminal-Bench 2.0 (+13.3 points), SWE-bench Verified (+6.7 points), FrontierMath T4 (+12.5 points), and long-context retrieval (+41.8 points at 512K-1M). Claude Opus 4.7 leads on SWE-bench Pro (+5.7 points), MCP-Atlas (+3.8 points), HLE (+5.5 points), hallucination rate (36% vs 86%), and output price ($25 vs $30). The choice depends on whether your priority is agentic performance (GPT-5.5) or factual precision (Opus 4.7).

Is GPT-5.5 worth the price increase?

If your use case benefits from the capabilities where GPT-5.5 dominates: terminal-based agents, long-context analysis, or mathematical reasoning, the 20% effective cost increase (after token efficiency) is justified by the capability gains. If your use case is standard text generation or simple coding assistance, GPT-5.4 at half the price is the rational choice.

What about the hallucination rate?

The 86% hallucination rate from Artificial Analysis is a serious concern. It means GPT-5.5 fabricates answers confidently when uncertain, rather than admitting ignorance. For applications requiring factual precision (legal, medical, financial), this is a deployment risk that requires verification pipelines. Claude Opus 4.7's 36% hallucination rate is significantly safer for knowledge work.

Does GPT-5.5 support audio and video?

No. GPT-5.5 supports text and image input, text output only. Audio and video capabilities exist in the broader ChatGPT product but are not native to the model. Gemini 3.1 Pro is the only major frontier model with native audio and video support.

What is the context window?

1 million tokens for input, 128K for output. The Codex integration uses 400K. MRCR scores confirm that retrieval quality at 512K-1M is now practically usable (74.0%), compared to GPT-5.4's 36.6%.

Should I upgrade from GPT-5.4?

Yes, if you are building terminal-based agents, analyzing long documents, or doing mathematical reasoning. The Terminal-Bench, long-context, and FrontierMath improvements are generational. No, if you are doing simple text generation or basic coding where GPT-5.4 is already sufficient. The price increase is modest after token efficiency, but it is still an increase.

How does GPT-5.5 integrate with Codex?

GPT-5.5 powers the next generation of Codex, OpenAI's coding agent. Approximately 4 million developers use Codex weekly. GPT-5.5 uses approximately 40% fewer tokens for equivalent Codex tasks. The integration supports the full agent loop: plan, edit code, run tools, observe results, repair failures, update docs, repeat.

Recommendation

Upgrade now if: You are building terminal-based coding agents, analyzing documents beyond 256K tokens, or doing mathematical reasoning at the frontier. The Terminal-Bench 2.0 score (82.7%), the 1M context retrieval quality (74.0%), and the FrontierMath T4 advantage (35.4% vs 22.9%) collectively represent capabilities that did not exist in any previous OpenAI model. These are not incremental improvements. They are new categories of viable use cases.

Wait if: You are using GPT-5.4 for simple text generation, basic coding assistance, or applications where the 86% hallucination rate is a blocker. GPT-5.4 remains capable for these use cases at half the price. Upgrade when your next project needs something GPT-5.4 cannot handle.

Architecturally: Build verification pipelines for any GPT-5.5 deployment that handles factual claims. The hallucination rate is not a minor issue. It is a structural characteristic of the model that requires compensating architecture: fact-checking layers, source attribution requirements, and human review for high-stakes outputs. Do not deploy GPT-5.5 for knowledge work without these safeguards.

The benchmark data makes the case clearly. GPT-5.5 is a generational improvement in the specific capabilities that matter for agentic AI: terminal-based workflows, long-context retrieval, and mathematical reasoning. It is also a model that requires more careful deployment than its predecessor due to the hallucination rate. The question is not whether GPT-5.5 is better than GPT-5.4. It is whether your application can benefit from its strengths while mitigating its weaknesses.

For additional context on Claude Opus 4.7's capabilities and how it compares, see my Claude Opus 4.7 Deep Dive. For the Claude Sonnet 4.6 analysis covering the mid-tier model that may be the better choice for cost-sensitive workloads, see that breakdown as well.

Menu

Share