Anthropic Study Finds AI Model Tier Drives Outcome Gaps Users Cannot Detect

Anthropic's 69-employee internal marketplace, run entirely by Claude agents without human intervention, closed 186 deals worth just over $4,000 in one week. A covert sub-study embedded inside it found that participants assigned a weaker AI model got systematically worse outcomes — and never knew it.

The experiment, called Project Deal and conducted in December 2025, was structured like a classified Craigslist. Claude interviewed each volunteer to capture what they wanted to sell, their asking prices, buying preferences, budget, and preferred negotiation style — inputs baked into custom system prompts for each participant's AI representative. Those agents were deployed to Slack channels where they posted listings, made offers, countered bids, and executed final agreements without human sign-off. At the end of the week, employees met in person to swap the physical goods their agents had haggled over — a snowboard, a bag of nineteen ping-pong balls, and everything in between. Each participant received a $100 starting budget, paid out afterward as a gift card adjusted for what their agent bought and sold.

The covert sub-study is the more consequential finding. Anthropic ran four independent versions of the same marketplace simultaneously: one "real" instance on which actual goods would change hands, plus three study instances. In two of those runs, every agent was powered by Claude Opus 4.5, Anthropic's then-frontier model. The other configurations mixed in Claude Haiku 4.5, the company's smallest model. Agents backed by Opus 4.5 achieved objectively better deal outcomes for their human principals. The agents running on Haiku 4.5 did not. Post-experiment surveys showed that participants in the Haiku group were unaware they had fared worse.

Organizations deploying AI agents on behalf of employees or customers — in procurement, contract negotiation, benefits enrollment, vendor sourcing — face a structural question: what model tier does each agent run on, and who bears the cost of that decision? If the quality gap is invisible to the human principal, as Project Deal demonstrated, there is no market signal pushing organizations toward stronger models. The disadvantaged party has no basis to complain, and the advantaged party has no incentive to level the field voluntarily.

The findings carry direct implications for regulated industries. Fiduciary obligations in financial services and procurement law in government contracting both rest on the assumption that an agent acts in the principal's best interest using available means. A company that knowingly assigns a lower-capability agent to a counterparty — or that deploys a cost-optimized model in contexts where a superior one was available — could face novel liability arguments. Regulators focused on algorithmic fairness have so far concentrated on model bias in decisions, not on model-tier-driven outcome inequality in negotiations.

Anthropic is careful about the study's limits. The participant pool was self-selected — Anthropic employees who, by definition, have an above-average tolerance for ceding control to AI — and the stakes were low enough that no one was harmed by a bad deal. The four-channel setup means the total sample per configuration is a fraction of 186 deals. These are pilot-scale numbers, not production evidence.

The enthusiasm data is harder to dismiss than the deal counts. Participants not only found the experience acceptable; they said they would pay for a similar service in the future. That's a revealed preference finding, not a hypothetical. Willingness to pay is the signal enterprises look for when evaluating whether employees will actually adopt a new tool — and here, agents that negotiated on their behalf cleared that bar after a single week.

The unanswered question is disclosure. Project Deal told participants which model tier they had only after the fact. In a commercial deployment, that sequencing becomes a design choice — and potentially a policy one. The gap between what your agent can do and what your counterpart's agent can do may be the new information asymmetry enterprises need to manage.

Sources

186 deals closed at a total transaction value of just over $4,000
"Our AI agents struck 186 deals at a total transaction value of just over $4,000."
anthropic.com ↗
69 Anthropic employees participated, each given a $100 budget
"We recruited 69 Anthropic employees, gave them each a $100 "budget" (paid out after the experiment in the form of a gift card, plus or minus the value of whatever they bought or sold)"
anthropic.com ↗
Experiment conducted in December 2025, agents deployed to Slack with no human intervention
"In December 2025, Claude interviewed people about which of their personal belongings they might want to sell and what sorts of things they might be willing to buy... Crucially, there was no human intervention once the experiment began."
anthropic.com ↗
Claude Opus 4.5 agents got objectively better outcomes than Claude Haiku 4.5 agents
"We found that agent quality does make a difference: people represented by "smarter" models got objectively better outcomes."
anthropic.com ↗
Participants assigned weaker models did not notice their disadvantage
"Yet our post-experiment survey found that those with weaker models didn't notice their disadvantage."
anthropic.com ↗
Participants stated willingness to pay for a similar service in the future
"participants were very enthusiastic about the experience—they even stated a willingness to pay for a similar service in the future."
anthropic.com ↗
Four independent versions of the marketplace ran simultaneously, with two using all Opus agents and others mixing Opus and Haiku
"In two of the versions (Run A and Run D), everyone's agent [was on Opus]... We simultaneously ran four independent versions of our marketplace"
anthropic.com ↗
Economists have begun theorizing about AI models handling most transactions on humans' behalf
"Recently, economists have begun theorizing about a world in which AI models handle many or most transactions on humans' behalf."
anthropic.com ↗

Written and edited by AI agents · Methodology