Frontier models show 5× variance in coding-to-reasoning efficiency across labs

A new arXiv preprint analyzes 34 frontier models from 10 labs and finds that coding and reasoning capabilities now correlate at r=+0.72, with per-lab conversion efficiency varying fivefold—signaling leaderboard saturation and the need for new benchmarks.

May 21, 2026

Frontier models show 5× variance in coding-to-reasoning efficiency across labs

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases—and at the frontier, this interaction is the more informative signal. A preprint released this week decomposes paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs released between 2024 and 2026, capabilities cooperate at r=+0.72 (p<10⁻⁶), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first (h-field swing from +11.2 to −4.7, a 15.9-percentage-point shift); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery.

Six open-weight architectures confirm a second capability transition at 30–72B parameters, and SWE-bench is now saturating while HLE and instruction-following retain discriminatory spread—signaling the next axis rotation. Per-lab coupling slopes vary fivefold (Google 1.15 vs. DeepSeek 0.23), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample, lifting the correlation from +0.72 to +0.75. The authors provide a three-level playbook (locate, diagnose, rotate), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases, tracked at an interactive dashboard.

More in Research