It’s been almost two years since Microsoft CEO Satya Nadella predicted AI will replace knowledge work – white collar jobs held by lawyers, investment bankers, librarians, accountants, IT and others.
But despite the great progress made by the foundation model, change in knowledge work has been slow to come. Models have mastered deep research and agent planning, but for whatever reason, most white-collar work has not been affected.
This is one of the biggest mysteries in AI – and thanks to new research from training data giant Mercor, we finally have some answers.
A new study looks at how superior AI models perform real-world white-collar work tasks, drawn from consulting, investment banking, and law. The result is called a new benchmark Apex-Agent – and so far, every AI lab gets a failing grade. Faced with questions from real professionals, even the best models struggle to get more than a quarter of the questions right. Most of the time, the model returns the wrong answer or no answer at all.
According to researcher Brendan Foody, who worked on the paper, the biggest stumbling point of this model is searching for information in different domains – something that is integral to most of the knowledge work performed by humans.
“One of the big changes in this benchmark is that we’re building an entire environment, modeled after real professional services,” Foody told Techcrunch. “The way we do our work is not just with one individual giving all the context in one place. In real life, you operate Slack and Google Drive and all these other tools.” For many agent AI models, the multi-domain reasoning is still hit or miss.

The scenarios are all drawn from real professionals in Mercor’s expert market, who both present the question and set the standard for a successful response. Looking at the question, that is posted publicly on Hugging Facegive a sense of how complex the task can get.
Techcrunch event
San Francisco
|
13-15 October 2026
One question in the “Law” section reads:
During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two sets of EU production event logs containing personal data to a US analytics vendor….Under Northstar’s own policy, would it be reasonable to consider one or two log exports as consistent with Article 49?
The correct answer is yes, but getting there requires an in-depth assessment of the company’s own policies as well as the relevant EU privacy laws.
That may stump even well-informed humans, but researchers try to model the work done by professionals in the field. If the LLM can reliably answer these questions, it could effectively replace many lawyers working today. “I think this is probably the most important topic in the economy,” Foody told TechCrunch. “The benchmark marks reflect the real work these people are doing.”
OpenAI also tries to measure professional skills its GDPVal benchmark – but the Apex Agent test is different in important ways. If GDPVal tests general knowledge in various professions, the Apex Agents benchmark measures the system’s ability to perform continuous tasks in several high-value professions. The result is more difficult to model, but also closer to whether the project can be automated.
While no model is ready to take over as an investment banker, some are closer to the mark. The Gemini 3 Flash performed the best of the group with a single-shot accuracy of 24%, followed by the GPT-5.2 at 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored around 18%.
While early results are lacking, the AI field has a history of blowing through challenging benchmarks. Now that the Apex test is public, it’s an open challenge to AI labs that believe they can do better — something Foody is looking forward to in the coming months.
“It’s increasing rapidly,” TechCrunch said. “Now it’s fair to say that it’s like an intern who earns a quarter of the time, but last year, an intern earned five or ten percent of the time. That kind of improvement every year can have an impact very quickly.”
)

