Stereological theory reveals LLM benchmark blind spots 100× larger than score gaps
A new arXiv preprint proves that structural uncertainty in LLM benchmarks dwarfs statistical noise, with simulated ranking swaps affecting 92% of trials across three major leaderboards.

A stereological theory of benchmark coverage posted to arXiv this week quantifies a fundamental blind spot in how LLM leaderboards rank models. The paper proves that for any evaluation suite, the hidden distance between two capability profiles producing identical scores grows exponentially with benchmark dimensionality—and empirically, that invisible gap is orders of magnitude larger than the score differences separating top-ranked models.
Analyzing Open LLM v2, a 12-benchmark extended suite, and LiveBench, researchers found effective dimensionality between 2.86 and 4.80 on competitive frontiers. The structural blind spot exceeded the first- vs. second-place score gap by two orders of magnitude and dominated statistical noise by 52–127×. Under six different hidden-capability priors, random visible/held-out splits swapped the top-ranked model in 92% of trials, with an average of 2.83 of the top 5 models changing position. This means that even with identical published scores, two models could have radically different true capabilities—and a different held-out test set would likely rank them differently.
A greedy algorithm identified a stable core of four benchmarks retaining predictive power across quarters with 93–97% retention; seven of twelve suffice for 90% coverage. Counterfactual validation confirmed that the eigenstructure predicts which evaluations are irreplaceable (ρ = −0.69, p = 0.013) and which external evaluations add new information (ρ = +0.38). The paper also resolves a 31-year-old open problem in optimal recovery theory, establishing the minimax rate for support functions in general dimension.



