GPT-5 and Claude compress clinical scores toward midpoint in cognitive screening study
A new preprint shows that zero-shot multimodal LLMs systematically over-predict low cognitive impairment scores and under-predict high scores when rating Clock Drawing Test images, creating a central tendency bias that undermines screening accuracy at the clinically critical extremes.

Multimodal large language models are being tested as automated raters for clinical cognitive assessments, but a new preprint reveals a systematic scoring flaw that could undermine their use in screening workflows. Researchers benchmarked GPT-5, Claude, and Gemini families against supervised Vision Transformers on two public Clock Drawing Test datasets using the Shulman ordinal rubric (0–5 scale). While fine-tuned ViTs achieved the lowest mean absolute error (0.52) and 91 percent within-1 accuracy, zero-shot LLMs remained competitive on tolerance-based agreement—GPT-5 hit 0.67 MAE and 92 percent within-1 accuracy. The problem emerges in per-score breakdowns: all three LLM families exhibit pronounced central tendency bias, compressing predictions toward the middle of the scale. Scores of 0 are systematically over-predicted toward 1, and scores of 5 are under-predicted toward 4, distorting the clinically critical endpoints where screening decisions for cognitive impairment are made.
Targeted ablations tested whether few-shot exemplars spanning the full score range or removing clinical jargon from the prompt would eliminate the effect; neither intervention succeeded. The authors frame the finding as an extension of LLM-as-a-judge bias research from NLP evaluation into clinical assessment, and call for calibration-aware metrics and post-hoc recalibration before deploying LLM raters in high-stakes screening. The study was posted on HuggingFace Papers in May 2026.