The AI Intern Problem: Apex-Agents AI benchmark 2026 Exposes White-Collar Limits

The Apex-Agents AI benchmark 2026 shows that even GPT-5.2 and Gemini 3 Flash fail to exceed 25% accuracy in real-world professional work tasks.

Was Satya Nadella's prediction premature? Two years after the Microsoft CEO claimed AI would replace knowledge work, the office desks of lawyers and bankers remain firmly human-occupied. New research from training-data giant Mercor reveals a startling gap between AI's potential and its professional performance.

Apex-Agents AI benchmark 2026: A failing grade for top labs

According to the newly released Apex-Agents benchmark, even the most advanced models are failing to tackle real-world professional tasks. Drawn from consulting, law, and investment banking, the test results show that no model could correctly answer more than a quarter of the queries. Most often, the AI provided incorrect answers or simply gave up.

Advertise with Us

[email protected]

Model	Apex-Agents Accuracy (One-shot)
Gemini 3 Flash	24%
GPT-5.2	23%
Opus 4.5	18%
Gemini 3 Pro	18%
GPT-5	18%

The hurdle of multi-domain reasoning

The primary stumbling block isn't raw intelligence, but environment. Researcher Brendan Foody noted that real-world work happens across Slack, Google Drive, and proprietary tools. Current agentic AI models still struggle with this kind of multi-domain reasoning, which is essential for tasks like assessing EU privacy law compliance against internal logs.

Apex-Agents AI benchmark 2026: A failing grade for top labs

The hurdle of multi-domain reasoning

Thoughts

Related Articles