Artificial Analysis Intelligence Index v4.0: Benchmarks Shift to Economic Utility

Artificial Analysis has released Intelligence Index v4.0, shifting AI benchmarking toward economic utility and real-world tasks. GPT-5.2 and Claude 4.5 take the lead.

The era of judging AI by its ability to pass multiple-choice tests is ending. Artificial Analysis, a leading independent benchmarking body, has launched its Intelligence Index v4.0, fundamentally resetting how the industry measures progress. The update addresses a 'saturation' problem where frontier models were consistently maxing out traditional tests, making it impossible for enterprises to differentiate between them.

Intelligence Index v4.0: Measuring Action Over Recall

The new index incorporates 10 evaluations covering agents, coding, and scientific reasoning. In a bold move, the organization scrapped long-standing benchmarks like MMLU-Pro and AIME 2025. In their place stands GDPval-AA, a test based on real-world tasks across 44 occupations. This shift represents a transition from measuring general knowledge to evaluating economically valuable actions—the kind of work people actually get paid to do.

Under the new recalibration, the grading curve has become significantly steeper. Top-tier models that previously scored 73 on the v3 scale now struggle to reach 50. Currently, OpenAI's GPT-5.2 with extended reasoning leads the pack with an ELO of 1442, followed by Anthropic's Claude 4.5 Opus at 1403.

Advertise with Us

[email protected]

The Hallucination Paradox and Scientific Limits

While productivity scores are rising, true reasoning remains a challenge. The CritPT evaluation, which uses graduate-level physics research problems, shows that even the most advanced systems are far from scientific discovery. GPT-5.2 topped this leaderboard with a humble score of only 11.5%.

Reliability also varies wildly. The AA-Omniscience Index revealed that high accuracy doesn't always mean low hallucination. Google's Gemini 3 Pro led in knowledge accuracy at 54% but showed higher hallucination rates than Anthropic's Claude 4.5 Sonnet, which achieved a hallucination rate of just 48%. For enterprise buyers, this distinction is critical for deployments in regulated industries.

Intelligence Index v4.0: Measuring Action Over Recall

The Hallucination Paradox and Scientific Limits

Thoughts

Authors

Related Articles