How Anthropic Tested AI Agents as Real Deal-Makers—and What Actually Happened

Anthropic ran an experiment called Project Deal, where AI agents negotiated real marketplace deals on behalf of 69 employees using $100 gift card budgets. The experiment produced 186 completed transac

Martin Holloway·Published 2w ago·6 min read·Based on 1 source

Reading level

How Anthropic Tested AI Agents as Real Deal-Makers—and What Actually Happened

Anthropic ran an experiment called Project Deal that put AI agents to work in the real world. The company had 69 employees act as buyers and sellers in online marketplaces, but instead of negotiating directly, they let AI agents do the talking and deal-making on their behalf. Each person got a $100 gift card budget and watched an AI agent represent them through the entire buying or selling process.

The experiment generated 186 completed transactions worth more than $4,000 across four separate online marketplaces. One marketplace was real—all those deals actually went through and were honored after the experiment ended. The other three existed purely for research and observation.

How the Experiment Worked

The setup was straightforward: all 69 participants started with the same $100 budget and the same tools. Each person's AI agent handled everything—finding items, negotiating prices, and closing deals—without the human stepping in to take control.

The four-marketplace structure allowed Anthropic to test different AI configurations side by side. While the company hasn't said publicly which specific AI models ran in each environment, this design let researchers compare how well different AI versions performed at negotiation.

Some AI Agents Did Better Than Others

Here's the key finding: participants assigned to more capable AI models ended up with better deals. Whether they were buying or selling, the stronger AI agents negotiated more favorable prices and terms for their human principals.

But here's the interesting part: the people using these agents had no idea their outcomes were better. They couldn't tell whether their AI was doing a mediocre or excellent job during the marketplace interactions. The performance gap was invisible to them.

Instructions Didn't Matter as Much as Expected

The experiment also tested something that receives a lot of attention in AI circles: the precise instructions given to each agent. Researchers expected that better instructions would lead to better outcomes. It didn't work that way. Initial instructions showed no real correlation with how successful the deals were or what prices the agents negotiated.

This suggests that an AI agent's performance depends far more on how smart the underlying AI model is than on how well you write the instructions fed into it. That has practical implications for companies planning to deploy AI agents at scale: you can't simply prompt your way to better results if the underlying model isn't sophisticated enough.

This Pattern Isn't New

We saw something similar in the late 1990s when early e-commerce sites began automating price comparison and bidding. Users got better deals without understanding why—the algorithmic advantage was hidden from them. The difference here is that these AI agents don't just follow simple rules; they have real conversations, assess conditions, make counteroffers, and reason through complex deals.

Anthropic built a marketplace environment that mirrors classified platforms like Craigslist or Facebook Marketplace, where negotiation skill traditionally decides whether you walk away with a good deal or a bad one. By inserting AI agents in place of humans, the experiment isolates the pure effect of computational negotiation ability—independent of whether someone is patient, experienced, or emotionally invested in the sale.

The broader context here is worth pausing on. As AI agents move from lab experiments to real commercial deployment, the fact that they can produce genuinely better outcomes—while remaining invisible to users—raises questions about fairness and transparency. If one person's AI agent is demonstrably smarter than another's, is that fair? And do users have a right to know how capable their agent actually is?

What This Means for Real-World Deployment

The results show that today's large language models can handle the kind of negotiation we assumed would require human judgment: assessing what an item is really worth, haggling over price, checking details, and actually sealing a deal. The $4,000 in real transactions proves it works, not just in theory but in practice.

The finding about instructions also matters for companies building AI systems. If you're deploying AI agents in a customer service or sales role, better performance likely comes from upgrading the underlying AI model itself, not from hiring expensive prompt engineers to write more clever instructions. That changes the economics of these systems.

The fact that users couldn't perceive the performance differences creates both opportunity and risk. Companies can roll out AI agents in different tiers without customers feeling shortchanged—they just experience what their agent produces, not a side-by-side comparison. But the invisibility also raises a concern: are people making informed choices if they don't know how capable their AI representative actually is?

What Happens Next

Project Deal comes at a moment when AI agents are moving from academic curiosities to real-world tools in customer service, sales, and deal-making. The capability gaps between more and less sophisticated models will likely influence how companies structure their AI offerings—and how regulators eventually oversee them.

Financial services and trading firms may find this work particularly relevant. If AI agents can negotiate complex deals autonomously in a marketplace, why not put them to work executing transactions in algorithmic trading or structured finance, where speed and consistency matter enormously?

What makes Project Deal credible is that real money changed hands and the company honored actual transactions. Simulations are useful, but they can't replicate the messiness of real negotiation. This methodology gives companies a template for testing AI agents in other high-stakes domains—anywhere negotiation and deal-making drive the business.

Project Deal provides a clear baseline: AI agents can negotiate real marketplace transactions, and different model capabilities produce measurably different results. As companies move toward broader deployment of these systems, understanding those capability differences becomes essential—both for competitive advantage and for building public trust in AI-mediated transactions.

How Anthropic Tested AI Agents as Real Deal-Makers—and What Actually Happened

How Anthropic Tested AI Agents as Real Deal-Makers—and What Actually Happened

How the Experiment Worked

Some AI Agents Did Better Than Others

Instructions Didn't Matter as Much as Expected

This Pattern Isn't New

What This Means for Real-World Deployment

What Happens Next

Related Articles

OpenAI Adds Workspace Agents for Teams and Companies

Anthropic Reaches $900 Billion Valuation, Surpassing OpenAI

How Artisan's Provocative AI Billboards Generated $2M in Sales