The AI Intern Problem: Apex-Agents AI benchmark 2026 Exposes White-Collar Limits
The Apex-Agents AI benchmark 2026 shows that even GPT-5.2 and Gemini 3 Flash fail to exceed 25% accuracy in real-world professional work tasks.
Was Satya Nadella's prediction premature? Two years after the Microsoft CEO claimed AI would replace knowledge work, the office desks of lawyers and bankers remain firmly human-occupied. New research from training-data giant Mercor reveals a startling gap between AI's potential and its professional performance.
Apex-Agents AI benchmark 2026: A failing grade for top labs
According to the newly released Apex-Agents benchmark, even the most advanced models are failing to tackle real-world professional tasks. Drawn from consulting, law, and investment banking, the test results show that no model could correctly answer more than a quarter of the queries. Most often, the AI provided incorrect answers or simply gave up.
| Model | Apex-Agents Accuracy (One-shot) |
|---|---|
| Gemini 3 Flash | 24% |
| GPT-5.2 | 23% |
| Opus 4.5 | 18% |
| Gemini 3 Pro | 18% |
| GPT-5 | 18% |
The hurdle of multi-domain reasoning
The primary stumbling block isn't raw intelligence, but environment. Researcher Brendan Foody noted that real-world work happens across Slack, Google Drive, and proprietary tools. Current agentic AI models still struggle with this kind of multi-domain reasoning, which is essential for tasks like assessing EU privacy law compliance against internal logs.
This content is AI-generated based on source articles. While we strive for accuracy, errors may occur. We recommend verifying with the original source.
Related Articles
Motorola unveils a minimalist wearable AI pendant at CES 2026. Featuring the Qira AI assistant, this proof-of-concept device explores the future of agentic technology.
CES 2026 shifts the spotlight from flashy hardware to 'EV realism' and Agentic AI. Explore the latest in automotive technology trends, including Sony-Honda's AFEELA 1.
Explore NVIDIA's major announcements at Microsoft Ignite 2025, including Blackwell GPU expansion on Azure, Agent 365 integration, and SQL Server 2025 AI capabilities.
OpenAI is using automated red teaming with reinforcement learning to strengthen ChatGPT Atlas against prompt injection attacks, creating a proactive loop to discover and patch exploits early.