Visual Aesthetic Benchmark exposes 42-point gap between frontier models and human judges
A new expert-labeled dataset reveals that the strongest multimodal models correctly identify both the best and worst image in a set only 26.5% of the time, compared to 68.9% for human judges.
Frontier multimodal models that claim to judge visual aesthetics are missing the mark by a wide margin, according to a new benchmark that pits them against expert human judges on comparative selection tasks.
The Visual Aesthetic Benchmark (VAB), introduced in a preprint released May 14, contains 400 tasks and 1,195 images spanning fine art, photography, and illustration. Each task asks the evaluator to pick the best and worst image from a candidate set with matched subject matter. Ten independent expert judges labeled each task, and the authors used consensus labels as ground truth. When researchers tested 20 frontier MLLMs and six dedicated visual-quality reward models, the strongest system correctly identified both the best and worst image across three random permutations of the candidate order in only 26.5% of tasks. Human experts achieved 68.9% on the same metric.
The gap stems partly from how models are typically trained. Most existing aesthetic-evaluation systems reduce judgment to a scalar score for a single image. The authors ran a controlled study with eight expert annotators and found that score-derived rankings align poorly with the same annotators' direct pairwise comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. That mismatch suggests scalar-score training doesn't capture the comparative signal that humans use when they actually judge aesthetics.
Fine-tuning a 35B-parameter model on 2,000 expert examples from VAB brought its accuracy close to that of a 397B-parameter open-weight model, suggesting the comparative signal is transferable and that the benchmark can guide future training. The dataset and evaluation code are available on HuggingFace.
