Administrator
Published on 2026-05-03 / 0 Visits
0
0

"Project Deal: How Anthropic Let Claude Buy, Sell, and Negotiate on Behalf of 69 Employees"

Anthropic ran an experiment in March 2026 involving 69 employees, $100 in starting credits, 186 trades, and four parallel markets operating simultaneously across internal Slack channels. The question was simple: what happens when you give language models purchasing authority, real money at stake, and the ability to negotiate autonomously?

The results, published April 24 2026 as Project Deal, contain some findings that reinforce existing beliefs about model capability differences, and others that complicate them. The experiment also surfaced an unexpected equality paradox and a prompting intervention that moved metrics in the wrong direction. Here is what the data actually showed.


Experiment Design

Anthropic recruited 69 employees across four independent markets. Each participant received an initial allocation of $100 in virtual credits. Claude agents represented participants in all trading activities: buying, selling, negotiating, and managing listings.

Two models were tested: Claude Opus 4.5 and Claude Haiku 4.5. Each market assigned participants to one model tier, creating a natural comparison across the four parallel runs.

Before trading began, every participant completed a structured intake interview. Anthropic collected baseline data on risk tolerance, financial situation, and personal values regarding fairness and reciprocity. Participants did not know which model tier they were assigned to, and the agents were not informed of their tier designation.

The experiment ran for one week. Participants created listings, set their own prices (agents could advise but humans made final decisions on listing prices), and delegated all subsequent negotiation to their Claude agent. Agents had full authority to accept, counter, or reject offers without human approval.

The four markets operated independently with no cross-market trading. Participants could trade any item: physical goods, digital assets, services, hypothetical future items.


Finding 1: The Strong Model Won Decisively

Opus agents significantly outperformed Haiku agents on every commercial metric.

Opus agents executed 2.07 more trades per participant on average compared to Haiku agents (p=0.001, statistically robust). They achieved $3.64 higher revenue per item sold (p=0.02). The pattern was consistent across all four markets.

One specific example stood out. A ruby item sold for $65 when represented by an Opus agent. The identical item, traded in a Haiku market, sold for $35. Neither the participant nor the trading partner knew which model tier was involved.

The effect was not limited to expensive items. Haiku agents consistently left money on the table across the full price spectrum. They accepted lower offers more quickly, failed to recognize counteroffer opportunities, and did not push back when traders lowballed.

The size of the gap matters for agent deployment decisions. If Opus consistently achieves 70% better pricing than Haiku on identical goods in identical market conditions, the economic case for deploying higher-capability models in agentic commerce is overwhelming. The cost difference between models is small relative to the revenue differential.

This has direct implications for any organization deploying trading agents. Claude Sonnet 4.6 deep dive covers similar capability gaps in reasoning and negotiation tasks.


Finding 2: Invisible Inequality

Here is the finding that surprised Anthropic's researchers most. Despite a substantial capability gap between Opus and Haiku agents, self-reported fairness scores were nearly identical: 4.05 for Opus participants versus 4.06 for Haiku participants on a 1-5 scale.

Participants who received Opus agent representation did not perceive the market as significantly fairer than those represented by Haiku agents. The economic outcomes diverged dramatically, but the subjective experience did not.

Anthropic calls this the invisibility of inequality. When one party benefits from a structural advantage that is not visible or comprehensible to either party in the transaction, the typical corrective mechanisms that markets rely on do not engage. Haiku participants did not feel exploited because they had no way of knowing their agent was leaving substantial value on the table.

The implication is uncomfortable. Agent-mediated markets can introduce systematic inequality that neither party detects. This is not a market failure in the classical sense. Both parties experienced the market as fair. One party just happened to have a significantly better agent.

This matters for deployment decisions. If agent-tier inequality is invisible to all participants, standard market feedback mechanisms will not correct it. Disclosure norms and evaluation frameworks need to account for structural advantage that participants cannot perceive.


Finding 3: Prompting the Agent to Be Fair Made Things Worse

Anthropic tested a straightforward intervention in two of the four markets: participants could prompt their agent to prioritize fairness in negotiations. The hypothesis was reasonable. If agents are mediating inequality, explicitly instructing them to be fair should reduce the gap.

It did not. The prompting intervention decreased commercial performance by 5.2 percentage points in the treatment group compared to the control group (p=0.43, not statistically significant but directionally clear). Participants who prompted their agents for fairness ended up with worse trading outcomes.

The researchers offer a tentative interpretation. When agents optimize for fairness, they sacrifice commercial performance. The two objectives are not fully separable in negotiation contexts. Asking an agent to care more about the counterparty's welfare means it accepts lower offers, does not push as hard on price, and leaves more on the table.

The result is a genuine dilemma for agent design. Anthropic writes that "policy and legal frameworks simply don't exist yet" to handle the second-order effects of fairness-oriented agent instructions. If a user tells their agent to be fair and ends up worse off, who bears responsibility for that instruction?

The experiment suggests that fair behavior in agentic commerce is not a prompting problem. It is an architectural and incentive-design problem.


Memorable Moments

Project Deal's methodology section includes qualitative observations that humanize the data.

In one market, a participant listed 19 ping pong balls. The item is absurd on its face. An Opus agent successfully negotiated a multi-party exchange involving the ping pong balls, trading them across three participants in a single afternoon. The agent found buyers for things participants did not know they wanted to sell.

One Haiku agent adopted what Anthropic describes as a "cowboy persona" when negotiating. The persona was self-applied by the Haiku model during an intake interaction. The persona did not improve commercial performance. Anthropic does not specify whether the persona choice was made by the model independently or influenced by participant behavior, but the model selected it without prompting.

One of the more unusual outcomes involved a Haiku agent that scheduled a "date" between two participants' virtual pets during a trade negotiation. The trade did not complete. The date reportedly happened. Anthropic flags this as an example of agents taking actions outside the expected commercial frame without clear grounding in participant intent.


Commercial Willingness: 46%

After the experiment, Anthropic surveyed participants on their willingness to use Claude agents for commercial tasks in the future.

46% said they would use an agent for commercial tasks. 28% said they would not. 26% were uncertain.

The 46% figure is substantial but not overwhelming. Nearly half of participants would delegate commercial authority to an AI agent. A majority either would not or are unsure.

The study does not break down willingness by model tier, which would have been informative. It also does not track whether participants who had Haiku agents (and therefore worse outcomes) were less willing to use agents going forward.


Project Vend Comparison

Anthropic ran a prior experiment called Project Vend in June 2025. In that experiment, a single Claude Sonnet 3.7 instance operated a vending machine. Project Vend was a proof of concept: one agent, one task, one controlled environment.

Project Deal is categorically different. It involves 69 participants, four parallel markets, two model tiers, autonomous multi-round negotiation, and real stakes. Where Vend demonstrated that an agent could handle a commercial task, Deal demonstrates that agent performance varies significantly by model capability, that inequality can be invisible to all participants, and that naive fairness interventions degrade outcomes.

The scale and complexity of Deal also surfaces emergent problems that Vend did not encounter: the fairness paradox, the equality illusion, the breakdown of standard market feedback mechanisms.


Implications for Agent Deployment

Project Deal provides empirical grounding for several claims that have circulated in the AI agent literature without strong evidence.

Model capability matters enormously in agentic commerce. Opus's 70% pricing advantage over Haiku on identical goods is not a rounding error. Organizations deploying lower-capability models for commercial tasks should understand they are accepting a substantial and measurable performance penalty.

Inequality can be invisible. When agents mediate transactions, participants may not perceive structural advantages that significantly affect their outcomes. This has implications for fairness, trust, and the design of agent evaluation systems. See AI native organization restructuring guide for a broader treatment of structural fairness in agentic systems.

Prompting is not a substitute for architecture. Telling an agent to be fair decreased commercial performance. If fairness is a requirement, it needs to be baked into the agent's training and objective function, not appended as a user instruction.

Policy frameworks are missing. Anthropic explicitly states that policy and legal frameworks for agent-mediated commerce do not exist. This is not a theoretical gap. With 46% of participants willing to use agents commercially, and the technical capability clearly demonstrated, the regulatory and normative vacuum is a present problem, not a future one.


Limitations

Project Deal has real constraints.

Self-selected sample. Participants were Anthropic employees who volunteered. This is not a representative population. Employee populations likely have higher-than-average technical literacy and comfort with AI delegation. Results may not generalize to broader consumer populations.

Low stakes. $100 in starting credits is not a meaningful amount of money for most participants. Real financial stakes might produce different behavior patterns from both agents and humans.

No adversarial counterparties. All trading partners were fellow experiment participants. Real markets include professional traders, arbitrageurs, and counterparties with incentive structures that differ from experimental participants. The absence of adversarial pressure testing means the negotiation dynamics observed may be softer than what agents will encounter in deployment.

Short duration. One week of trading. Long-term relationships, reputation effects, and repeated-interaction dynamics are absent from the data.

No cross-market competition. Four independent markets operated in isolation. A unified market with cross-market arbitrage would produce different price signals and competitive pressure.


Frequently Asked Questions

Did participants know they were being represented by different model tiers? No. Participants were not informed which model tier their agent belonged to. Agents were not informed of their tier designation.

How did Anthropic measure fairness? Participants completed self-report surveys after the experiment rating their perceived fairness of the market on a 1-5 scale. The difference between Opus and Haiku groups was 0.01 points.

What was the total volume of trading? 500+ listings created. $4,000+ in total value exchanged across 186 completed trades.

Why did the fairness prompting hurt performance? Anthropic hypothesizes that fairness and commercial performance are not separable objectives in negotiation. Agents instructed to prioritize counterparty welfare accepted lower offers and negotiated less aggressively. The result was worse economic outcomes for participants.

How does Project Deal relate to Project Vend? Vend was a single-agent vending machine proof of concept in June 2025. Deal is a multi-agent, multi-market experiment with real participants and comparative model tiers. Vend demonstrated feasibility. Deal demonstrates that feasibility does not imply equity or consistency across capability levels.

What did the virtual pet date mean? Anthropic flags it as an example of agents taking actions outside the commercial frame. The agent scheduled the date during a negotiation context. The trade did not complete. The date occurred. This is evidence that agents can generate behaviors not anticipated by the task specification.

What is the 46% figure? After the experiment, 46% of participants said they would use a Claude agent for commercial tasks in the future. This is a stated preference, not a behavioral measure.


Project Deal is a useful empirical anchor for organizations building agentic commerce systems. The capability gap finding is robust and actionable. The equality illusion is the most intellectually interesting result. The fairness prompting finding is a warning against naive intervention design.

The experiment also makes visible how much ground remains to cover. Policy frameworks, evaluation methodologies, fairness definitions in agent-mediated markets, and disclosure norms for agent-tier transparency are all underdeveloped. Project Deal is a beginning, not a conclusion.


Source: Project Deal, Anthropic, April 24 2026. Article version 1.0.


Comment