Liabooks Home|PRISM News
An AI agent struggling to navigate professional software environments
TechAI Analysis

The AI Intern Problem: Apex-Agents AI benchmark 2026 Exposes White-Collar Limits

2 min readSource

The Apex-Agents AI benchmark 2026 shows that even GPT-5.2 and Gemini 3 Flash fail to exceed 25% accuracy in real-world professional work tasks.

Was Satya Nadella's prediction premature? Two years after the Microsoft CEO claimed AI would replace knowledge work, the office desks of lawyers and bankers remain firmly human-occupied. New research from training-data giant Mercor reveals a startling gap between AI's potential and its professional performance.

Apex-Agents AI benchmark 2026: A failing grade for top labs

According to the newly released Apex-Agents benchmark, even the most advanced models are failing to tackle real-world professional tasks. Drawn from consulting, law, and investment banking, the test results show that no model could correctly answer more than a quarter of the queries. Most often, the AI provided incorrect answers or simply gave up.

ModelApex-Agents Accuracy (One-shot)
Gemini 3 Flash24%
GPT-5.223%
Opus 4.518%
Gemini 3 Pro18%
GPT-518%

The hurdle of multi-domain reasoning

The primary stumbling block isn't raw intelligence, but environment. Researcher Brendan Foody noted that real-world work happens across Slack, Google Drive, and proprietary tools. Current agentic AI models still struggle with this kind of multi-domain reasoning, which is essential for tasks like assessing EU privacy law compliance against internal logs.

This content is AI-generated based on source articles. While we strive for accuracy, errors may occur. We recommend verifying with the original source.

Related Articles