"Claude Opus 4.7 Deep Dive: How Anthropic's Latest Flagship Outperforms in Coding, Agents, and Vision"

What Is Claude Opus 4.7?

Anthropic's Claude Opus 4.7 landed on April 16, 2026, and it arrives with a positioning statement that should make every engineering team pay attention: it is the most capable General Availability model Anthropic has released, sitting one tier below the restricted Mythos Preview access that requires Glasswing partner status.

Let that sink in for a moment. A GA model that Anthropic is willing to call its most capable GA release is not a incremental improvement. It is a signal about where the actual frontier has moved.

The pricing tells the same story. Opus 4.7 maintains the $5/$25 per million tokens input/output structure that Opus 4.6 established. No price change despite substantial capability gains. If you were on the fence about Opus, the cost-to-capability ratio just shifted significantly in favor of deployment.

The technical foundation is unchanged in some respects: 1 million token context window, 128K maximum output, training cutoff in January 2026. But the changes that actually matter are not in the headline specs. They are in four behavioral dimensions that reshape how Opus 4.7 approaches real engineering work.

Before getting into those details, here is the benchmark scorecard that establishes the foundation for everything else.

The Complete Benchmark Scorecard

These numbers come from Anthropic's official announcement. Every figure is verified. Some of them are startling.

Coding Benchmarks

Benchmark	Opus 4.7	Opus 4.6	Delta	Notes
SWE-bench Verified	87.6%	80.8%	+6.8	Most critical coding benchmark
SWE-bench Pro	64.3%	53.4%	+10.9	Full agentic environment
Terminal-Bench 2.0	69.4%	65.4%	+4.0	GPT-5.4 leads at 75.1%

Agentic Tool Use Benchmarks

Benchmark	Opus 4.7	Opus 4.6	Delta	Notes
MCP-Atlas	77.3%	62.7%	+14.6	Largest single gain
OSWorld-Verified	78.0%	72.7%	+5.3	Computer use environment
Finance Agent v1.1	64.4%	60.7%	+3.7	Specialized domain agent
HLE (w/ tools)	54.7%	53.1%	+1.6	Incremental
BrowseComp	79.3%	83.7%	-4.4	Regression

Vision Benchmarks

Benchmark	Opus 4.7	Opus 4.6	Delta	Notes
CharXiv-R (w/ tools)	91.0%	77.4%	+13.6	Document understanding
CharXiv-R (no tools)	82.1%	68.7%	+13.4	Pure visual reasoning
XBOW Visual Acuity	98.5%	54.5%	+44.0	Most dramatic jump

Reasoning Benchmarks

Benchmark	Opus 4.7	Opus 4.6	Delta	Notes
GPQA Diamond	94.2%	91.3%	+2.9	Expert-level reasoning

Cyber Security

Benchmark	Opus 4.7	Opus 4.6	Delta	Notes
CyberGym	73.1%	73.8%	-0.7	Negligible

Where Opus 4.7 Dominates

The three numbers that should dominate your planning conversations are these:

SWE-bench Pro at +10.9 points. SWE-bench Pro is not the toy version. It runs Claude in a full agentic environment where the model must plan, execute, use tools, handle errors, and verify results across multi-step software engineering tasks. A 10.9 point jump in this environment means Opus 4.7 is substantially better at being an autonomous coding agent. If you are building Claude-powered development tools, this is the number that justifies upgrading.

MCP-Atlas at +14.6 points. This is the largest single benchmark jump in the release. MCP-Atlas tests model capability with Model Context Protocol tools, which is the tool-calling standard that the entire Claude ecosystem uses. The jump from 62.7% to 77.3% means Opus 4.7 handles tool-based workflows dramatically better than its predecessor. If your Claude integration uses tools, MCP performance matters directly.

XBOW Visual Acuity at +44.0 points. This is not a typo. XBOW jumped from 54.5% to 98.5%, a 44 point improvement. XBOW tests visual acuity using board and card game screenshots. The jump signals that something fundamental changed in how Opus 4.7 processes visual information. This has direct implications for any Claude Code Computer Use work, any visual document understanding pipeline, and any workflow that involves screenshots, diagrams, or visual artifacts.

Where Opus 4.7 Regresses

One benchmark went backward, and it deserves an honest explanation rather than dismissal.

BrowseComp dropped 4.4 points, from 83.7% to 79.3%. BrowseComp tests web search and information retrieval performance. The drop is real.

Why did this happen? The most plausible explanation is that Opus 4.7's improved instruction-following behavior makes it less likely to shortcut through cached or guessed information. Where Opus 4.6 might have extrapolated plausibly from limited context, Opus 4.7 is more literal about verifying through actual search. On benchmarks that reward confident incorrect answers, literal verification can look like regression.

The mitigation is straightforward: if BrowseComp performance matters for your specific application, test Opus 4.7 against your actual use case before migrating. The regression on a synthetic benchmark may not translate to your production environment. Many real-world search workloads will benefit from the improved instruction-following behavior.

The Four Changes That Actually Matter

Benchmark numbers tell you what changed. Behavioral changes tell you why.

Self-Verification Behavior

The most practically significant change in Opus 4.7 is what Anthropic describes as enhanced self-verification. The model now implements a Plan → Execute → Verify → Report loop that was less consistent in Opus 4.6.

Concretely, this means that when Opus 4.7 produces code, it is more likely to check that code against the requirements it was given before presenting the result. When it uses tools, it is more likely to validate tool outputs before passing them to the next step. When it completes a multi-step reasoning task, it is more likely to sanity-check the final conclusion against the problem statement.

This is not dramatic behavior change visible in single prompts. It shows up in multi-step agentic workflows where Opus 4.6 would occasionally hand off partially-complete work or miss error conditions. Opus 4.7 closes those gaps more consistently.

For Claude Code users specifically, this means fewer instances where the agent produces something that looks reasonable but misses a key requirement. The self-verification loop does not eliminate all errors, but it reduces the category of errors that come from insufficient cross-checking.

Literal Instruction Following

Opus 4.7 takes instructions more literally than Opus 4.6. This sounds like an obvious improvement, but it has a nuanced implication: if you gave Opus 4.6 instructions that it generalized beyond their literal meaning, Opus 4.7 may not apply that same generalization.

For example, if a system prompt said "summarize the key points" and Opus 4.6 sometimes anticipated what you meant by "key" based on context, Opus 4.7 is more likely to apply a more literal interpretation of "key points." The model is more predictably obedient but less predictably mind-reading.

This is a significant consideration for prompt engineers who have built complex instruction sets. Audit your system prompts for implicit assumptions about how the model will generalize your instructions. Literal instruction following rewards precise language and penalizes clever shorthand.

High-Resolution Vision

The vision system in Opus 4.7 received a substantial hardware-level upgrade. The maximum resolution increased from 1,568 pixels to 2,576 pixels, which works out to approximately 3.3 times more pixels in the same input.

The practical implication is not just that Opus 4.7 can see bigger images. It is that the pixel-to-coordinate mapping is now 1:1, meaning a pixel in the input maps directly to a pixel in the model's internal representation. This eliminates the aliasing and information loss that occurred when high-resolution images were downscaled to fit the previous limit.

For Claude Code Computer Use workflows, this matters directly. Screenshots of modern high-DPI displays that would have lost detail in the previous model are now processed at full resolution. For document understanding tasks involving charts, diagrams, or dense text layouts, the 1:1 mapping preserves information that was previously lost in downscaling.

The CharXiv-R benchmark gains (13.4 to 13.6 points depending on tool configuration) confirm that document understanding improved substantially. The XBOW gain confirms that visual acuity for game states and similar detailed visual parsing improved even more dramatically.

xhigh Effort Level and Task Budgets

A new concept appeared in the Opus 4.7 documentation: the xhigh effort level, which is now the default for Claude Code. This sits above the previous high effort level and comes with a token budget advisory system.

The practical meaning: when you give Claude Code a task, the model now has an explicit budget for how much work it will invest in that task, and it manages that budget against the complexity it detects. Tasks that look simple get efficient treatment. Tasks that look complex get more thorough investigation.

Claude Code's Auto mode, which was introduced alongside Opus 4.7, takes this further by automatically selecting effort levels based on task complexity. This is the direction that agent systems are heading: not just executing instructions but reasoning about the appropriate level of investment for each instruction.

The /ultrareview command, also new in Claude Code, leverages the xhigh effort level to conduct more thorough code reviews. If you are using Claude Code for development workflows, the combination of xhigh default and /ultrareview means your baseline code review quality has improved without any configuration changes.

The Tokenizer Catch: Same Price, Different Bill

Here is the detail that will catch many teams off guard: Opus 4.7 uses a different tokenizer than Opus 4.6, and the tokenization ratios have changed.

The headline price per million tokens is unchanged. But depending on what you are processing, your actual token consumption will be different. Here are the practical ratios:

Content Type	Token Multiplier vs Opus 4.6	Practical Impact
English prose	~1.05x	Modest increase
Code	~1.10x	Meaningful for code-heavy workflows
CJK (Chinese, Japanese, Korean)	~1.30x	Significant for multilingual content
JSON / structured data	~1.20x	Important for API-heavy applications

For a typical English prose workload, the increase is about 5%. For code-heavy development work, expect roughly 10% more tokens for the same content. If your application processes significant amounts of JSON or structured data, plan for a 20% increase.

This matters for cost planning. If you were budgeting based on Opus 4.6 token counts, your actual spend with Opus 4.7 will be higher. The price per token is the same, but the number of tokens per unit of content has increased.

The CJK impact is particularly important for teams building multilingual applications. Chinese text in particular tokenizes less efficiently, resulting in approximately 30% more tokens for the same content. This is not a bug. It reflects the fact that the Opus 4.7 tokenizer was trained on a broader corpus that includes more CJK characters as distinct tokens. But it will surprise teams that assumed token counts would be roughly equivalent across Claude versions.

Breaking API Changes

Opus 4.7 introduces API changes that will break existing code. Here is what you need to know.

Temperature, top_p, and top_k: 400 Error

If you are passing temperature, top_p, or top_k as parameters to the Messages API, you will now receive a 400 error. These parameters are no longer accepted.

The replacement is a different sampling mechanism that Anthropic has not fully documented yet. If you have code that sets temperature explicitly, you need to remove those parameters before migrating to Opus 4.7.

This is an aggressive breaking change. Most codebases that do anything with sampling have at least one place where temperature is set. Audit your API call sites before upgrading.

Adaptive Thinking Replaces Manual Budgets

The previous thinking budget parameters (max_tokens, thinking) are deprecated in favor of adaptive thinking. Instead of specifying how much thinking the model should do, you now specify what you want the model to think about, and the model allocates its own thinking resources based on task complexity.

This is a conceptual shift from resource specification to intent specification. You describe the problem; the model decides how hard to think about it.

For most use cases, this will produce better results with less configuration. For cases where you need predictable thinking behavior (for example, in latency-sensitive applications where you need consistent response times), you may need to add explicit complexity indicators in your prompts to guide the model's thinking allocation.

Thinking Hidden by Default

In Opus 4.6, the thinking block was visible in API responses by default. In Opus 4.7, thinking is hidden by default. If you want to see the model's reasoning trace, you need to request it explicitly.

This is the correct default for production applications where you do not want to expose internal reasoning to end users. But it means that any code that relies on parsing the thinking block from responses will break. Check your response parsing logic.

Prefill Removed

The prefill mechanism that allowed you to seed the model's response with partial text has been removed. If your application uses prefill to guide response format or to prime the model with partial answers, that code will need to be rewritten.

Migration Guide: What Breaks and How to Fix It

Here is the actionable checklist for migrating from Opus 4.6 to Opus 4.7.

Issue	Detection	Fix
Temperature/top_p/top_k parameters	400 error on API calls	Remove all sampling parameters from API calls
Thinking block parsing	Empty responses in code that reads thinking	Add `thinking: { type: "enabled" }` to request
Prefill usage	400 error on prefill requests	Remove prefill from request body
Token count estimates	Budget miscalculations	Recalculate token budgets with new ratios (prose ~1.05x, code ~1.10x, JSON ~1.20x, CJK ~1.30x)
Implicit instruction generalization	Behavioral changes in edge cases	Audit system prompts for non-literal instruction patterns
BrowseComp dependency	Potential performance regression	Test against your specific search use case before migrating

Test every breaking change in a staging environment before pushing to production. The token count issue in particular will not show up as an error. It will show up as unexpectedly higher bills.

Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: When to Choose What

Here is the decision matrix based on published benchmark data and what we know about each model's positioning.

Criterion	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	87.6%	Unknown	Unknown
SWE-bench Pro	64.3%	Unknown	Unknown
Terminal-Bench 2.0	69.4%	75.1%	Unknown
MCP-Atlas	77.3%	Unknown	Unknown
OSWorld	78.0%	Unknown	Unknown
Vision	Exceptional (XBOW 98.5%)	Strong	Strong
Pricing	$5/$25	Higher	Lower
Context	1M	1M	1M
Tool use	Excellent (+14.6 MCP)	Good	Good
API flexibility	High	Moderate	High
Claude Code native	Yes	No	No

Choose Opus 4.7 when: You are building Claude Code integrations, you prioritize tool-use performance, you need the best coding agent benchmark scores, you want Claude Code's native experience, or you are already all-in on the Anthropic ecosystem.

Choose GPT-5.4 when: Terminal-Bench 2.0 performance is your primary benchmark (where GPT-5.4 leads at 75.1% vs Opus 4.7's 69.4%), you are building in the Microsoft ecosystem with deep Azure OpenAI integration, or your team has existing GPT-5.x prompt libraries that would be costly to migrate.

Choose Gemini 3.1 Pro when: Cost is the primary constraint, you are building on Google Cloud with Vertex AI requirements, or you need the Google ecosystem integrations for Workspace, Search, or other Google services.

The Terminal-Bench gap is real and worth acknowledging directly. GPT-5.4 leads Opus 4.7 by 5.7 points on Terminal-Bench 2.0. If your primary use case is terminal-based agent work and that benchmark is your proxy for production performance, GPT-5.4 has the edge there. For everything else that matters in practical agent development, Opus 4.7's MCP-Atlas lead (+14.6) and SWE-bench Pro lead (+10.9) are more representative of real-world coding agent scenarios.

The Mythos Preview Question: Wait or Deploy Now?

Anthropic has been clear that Mythos Preview is their actual frontier model, and it scores higher than Opus 4.7 on every benchmark where we have comparison data. But Mythos Preview is restricted to Glasswing partners, which is a small set of approved enterprise customers.

The practical question for most teams is not "Mythos or Opus 4.7?" It is "Opus 4.7 or the competition?"

The honest answer: if you need the best model Anthropic offers and you are not a Glasswing partner, Opus 4.7 is what you deploy. The benchmark gains over Opus 4.6 are substantial enough that upgrading from 4.6 to 4.7 is clearly worthwhile. The gap between Opus 4.7 and Mythos Preview is a future problem, not a present one.

If you are on Opus 4.6 today, the upgrade calculus is simple. The SWE-bench Pro improvement (+10.9 points) alone justifies migration for any team building coding agents. The MCP-Atlas improvement (+14.6 points) confirms that tool-based workflows have gotten dramatically better. The tokenization change is a modest cost increase that most use cases will absorb without complaint.

Multi-Model Architecture Patterns

The emerging pattern in production Claude systems is not "pick one model." It is "route tasks to the right model." Here is the architecture pattern that makes sense based on Opus 4.7's positioning.

Opus 4.7 for planning. Complex multi-step reasoning, architecture decisions, security-sensitive operations, and initial task decomposition benefit from Opus 4.7's self-verification behavior and instruction-following precision. Use Opus 4.7 when you need the model to think carefully before acting.

Sonnet 4.6 for execution. Once a plan exists, Sonnet 4.6 handles the execution work at 1/5th the cost of Opus with near-parity on coding and computer use benchmarks. Sonnet 4.6 scores 79.6% on SWE-bench Verified, 59.1% on Terminal-Bench, 61.3% on MCP-Atlas, 72.5% on OSWorld, and 74.1% on GPQA. For everything that is "execute the plan," Sonnet 4.6 is the rational choice.

Haiku 4.5 for review. High-throughput, low-cost review tasks like initial triage, basic validation, and pattern matching against known patterns work well with Haiku's speed and cost profile. Use Haiku when you need to process many items quickly and can accept lower per-item reasoning depth.

The routing criteria that determine which model handles which step:

Complexity threshold: If the task requires more than N tool-calling steps or involves ambiguous requirements, route to Opus 4.7. If it is straightforward execution, route to Sonnet 4.6.
Cost threshold: If the task volume is high and the per-item value is low, Haiku for triage with escalation criteria to Sonnet or Opus.
Risk threshold: If incorrect output could cause harm (security, financial, medical), route to Opus 4.7 for verification even if Sonnet could handle the primary task.
Latency threshold: If the response must be real-time, Sonnet 4.6 or Haiku 4.5. If async is acceptable, Opus 4.7's thinking time is worth it.

This multi-model routing approach is how production systems achieve both quality and cost efficiency. You do not need to choose between capability and cost when you can route appropriately.

For a deeper analysis of Sonnet 4.6's capabilities and when it is the better choice over Opus, see my Claude Sonnet 4.6 Deep Dive.

What Changed in Claude Code

If you use Claude Code, Opus 4.7 arrives with several changes that affect your daily workflow.

xhigh is now the default effort level. This means Claude Code invests more effort by default in understanding your requirements and verifying its outputs. For most tasks, this is a pure improvement. You get better results without changing how you write prompts.

The /ultrareview command leverages the xhigh effort level for code reviews that go beyond surface-level checking. If you have been using Claude Code for reviews, the quality of feedback should improve noticeably.

Auto mode for Max subscribers automatically selects effort levels based on detected task complexity. This is the direction Claude Code is heading: not just executing instructions but reasoning about how much effort each instruction deserves.

The practical summary: if you were happy with Claude Code on Sonnet 4.6, the upgrade to Opus 4.7 as the default for complex tasks is transparent. You get better results without changing your workflow.

The Design Personality Gotcha

Here is the detail that frontend teams need to know before deploying any Claude-powered feature that involves design output.

Claude Design, released April 17, 2026 (the day after Opus 4.7), uses Opus 4.7 and has a persistent default aesthetic: warm cream backgrounds, serif typography, and terracotta accent colors. This is not configurable at the API level. It is the default design personality that Claude Design produces when you ask it to create visual artifacts.

If you are building a product that uses Claude Design or any Opus 4.7-powered feature that generates design output, the default aesthetic will bleed through unless you are explicit about overriding it. This matters for product teams that have established brand guidelines.

The gotcha is this: the default design personality is persistent across sessions. If your application relies on Claude-generated design artifacts and you do not explicitly specify design constraints, the warm cream/serif/terracotta aesthetic will appear. For some products this is fine. For others it will conflict with brand requirements.

Specify design constraints explicitly in every prompt that involves visual output. Do not rely on the model to infer your aesthetic preferences.

For a full analysis of Anthropic's design product strategy, see my Anthropic Full Product Stack 2026 analysis.

Claude Sonnet 4.6 Cross-Reference: When Sonnet Is the Better Choice

The temptation with a new flagship model is to route everything to it. That is the wrong instinct.

Sonnet 4.6 at $3/$15 per million tokens is 1/5th the cost of Opus 4.7 at $5/$25. For most production workloads, the cost-capability balance favors Sonnet 4.6.

Specifically, Sonnet 4.6 is the better choice when:

Daily driver development. For the routine coding work that makes up most of a developer's day, Sonnet 4.6's 79.6% on SWE-bench Verified is within striking distance of Opus 4.7's 87.6%. The 8-point gap matters for the hardest 10% of tasks, but it does not justify routing all work to Opus.

High-volume tool use. Sonnet 4.6's 61.3% on MCP-Atlas is close enough to Opus 4.7's 77.3% for most tool-use scenarios. The gap is significant in absolute terms, but Sonnet's cost advantage is larger.

Computer use tasks. Sonnet 4.6's 72.5% on OSWorld versus Opus 4.7's 78.0% is a 5.5-point gap. For routine browser automation and document processing, that gap is acceptable given Sonnet's cost advantage.

When you need speed. Sonnet 4.6 has lower latency than Opus 4.7 for equivalent prompt lengths. If your application has real-time requirements, Sonnet is the default.

Reserve Opus 4.7 for the tasks where Sonnet 4.6 genuinely struggles: complex multi-step reasoning, security-sensitive operations, and the hardest coding problems where the self-verification loop makes a measurable difference.

FAQ

How much does Claude Opus 4.7 cost?

Opus 4.7 is priced at $5 per million tokens for input and $25 per million tokens for output through the Anthropic API. It is also available on AWS Bedrock and Google Vertex AI with the same pricing structure. Consumer access is included in Pro ($20/month) and Max ($100-200/month) plans.

What is the context window for Opus 4.7?

Opus 4.7 supports a 1 million token context window, the same as Opus 4.6. The maximum output is 128K tokens per request.

How does Opus 4.7 compare to Opus 4.6?

Opus 4.7 outperforms Opus 4.6 on almost every benchmark. Key gains: SWE-bench Verified +6.8 points (87.6% vs 80.8%), SWE-bench Pro +10.9 points (64.3% vs 53.4%), MCP-Atlas +14.6 points (77.3% vs 62.7%), and XBOW Vision +44 points (98.5% vs 54.5%). The only regression is BrowseComp at -4.4 points.

What is Mythos Preview?

Mythos Preview is Anthropic's actual frontier model that sits above Opus 4.7 in capability. It is restricted to Glasswing partner customers and is not available through standard API access. Published benchmark scores show Mythos Preview outperforming Opus 4.7 on all benchmarks where comparison data exists.

Should I upgrade from Opus 4.6 to Opus 4.7?

Yes, if you use Claude for coding agent work, tool-based workflows, or any vision-related tasks. The SWE-bench Pro gain (+10.9), MCP-Atlas gain (+14.6), and XBOW gain (+44) collectively represent substantial improvements in the capabilities that matter most for production agent systems. The tokenization increase (~5-10% depending on content type) is a modest cost to pay for these gains.

What breaking API changes does Opus 4.7 introduce?

Three significant breaking changes: temperature/top_p/top_k parameters now return 400 errors and must be removed; thinking is hidden by default and must be explicitly requested; prefill has been removed entirely. There are also new tokenization ratios that will increase token counts by 5-30% depending on content type, affecting cost calculations.

Does the BrowseComp regression mean Opus 4.7 is worse at search?

Not necessarily for your use case. The BrowseComp regression (-4.4 points) likely reflects Opus 4.7's more literal instruction following and reduced shortcut-taking on synthetic benchmarks. Real-world search workloads may perform differently. Test against your specific application before assuming the regression applies to your use case.

How does Opus 4.7 handle vision tasks?

Opus 4.7 processes images at up to 2,576 pixels compared to Opus 4.6's 1,568 pixel limit, with 1:1 pixel-to-coordinate mapping that eliminates aliasing. This produces dramatic improvements on vision benchmarks including XBOW (98.5%, up from 54.5%), CharXiv-R with tools (91.0%, up from 77.4%), and CharXiv-R without tools (82.1%, up from 68.7%).

Recommendation

Upgrade now if: You are on Opus 4.6 and building coding agents, tool-based workflows, or any application that uses vision. The benchmark gains are real and substantial. The API migration requires removing sampling parameters, adjusting token budgets, and updating response parsing. All of these are one-time fixes that will pay off immediately.

Wait if: You are on Opus 4.6 and primarily doing simple text generation with no agentic components, no tool use, and no vision. The upgrade gains will be less visible in that context, and the migration effort may not be worth it yet. But watch for the point where your next project needs something Opus 4.6 struggles with, and upgrade at that decision point.

Architecturally: Build multi-model routing into your system from the start. Opus 4.7 for planning and verification, Sonnet 4.6 for execution, Haiku 4.5 for triage. The cost savings from not routing everything to the flagship model are significant, and the capability differences between tiers are narrow enough that routing based on complexity produces better cost-capability ratios than defaulting to the most capable model.

The benchmark data makes the case clearly. Opus 4.7 is not a marginal improvement. It is a substantial step forward in the capabilities that matter for production AI engineering work. The question is not whether to upgrade but how quickly you can migrate safely.

For additional context on Anthropic's full product strategy, see Anthropic's Full Product Stack in 2026. For the Trust But Canary Meta Configuration Safety patterns relevant to deploying these models at scale, see my analysis of Meta's approach to configuration safety.

Share