Microsoft built a fake market to test AI agents – they failed in a shocking way


On Wednesday, researchers at Microsoft released a new simulation environment designed to test new AI agents, with new research suggesting they may be vulnerable to manipulation. Conducted in collaboration with the University of Arizona, the research raises new questions about how AI will behave when it is not implemented – and how AI companies can make promises to agonists.

A simulated environment, it’s called “Magnetic Market” With Microsoft, it is built as a synthetic platform for experiments on the behavior of AI agents. A typical experiment might involve agents trying to order dinner according to the user’s instructions, while agents representing different restaurants can win.

The team’s initial experiments involved 100 separate customer agents interacting with 300 business agents. Because the source code for the market is open source, it will be straightforward for other groups to choose new experiments or generate new discoveries.

Kamar, Managing Director of the Microsoft Research Foundation, said research like this would be crystal clear in understanding the capabilities of AI agents. “There’s a question about how the world is going to change by having these agents having fun and talking to each other and negotiating,” Kamar said. “We want to understand these things deeply.”

Initial research looked at a mix of major models, including GPT-4O, GPT-5, and Gemini-2.5 flash, and found surprising weaknesses. In particular, the researchers found some techniques businesses can use to trick customer agents into buying their products. Researchers see a greater desire for efficiency as customer agents are given more options to choose from, more space for the agent.

“We want these agents to help us process our options,” Kamar said. “And we’re seeing that the current model is actually very comfortable with a lot of options.”

The agents also ran into problems when asked to collaborate towards a common goal, apparently unsure of which agents to play collaboratively. Performance was better when the model was given explicit instructions on how to collaborate, but the researchers still saw the model’s capabilities as needed.

TechCrunch events

San Francisco
I’m fat
October 13-15, 2026

“We can give you model pointers – like we can say, step by step,” Kamar said. “But if we test the collaboration capabilities, I expect the models to have those capabilities by default.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *