GPT-5.2 edges top human reviewer on Nature paper critiques in 45-scientist study

A 469-hour expert annotation study found a GPT-5.2 reviewing agent scored 60.0% on composite quality metrics versus 48.2% for the top-rated human reviewer across 82 Nature-family papers, though AI reviewers exhibit 16 recurring weaknesses humans don't share.

May 18, 2026

GPT-5.2 edges top human reviewer on Nature paper critiques in 45-scientist study

A GPT-5.2-powered reviewing agent outperformed the top-rated human reviewer on composite quality metrics in a large-scale peer-review study. Forty-five domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms from human-written and AI-generated reviews of 82 Nature-family papers. The GPT-5.2 agent scored 60.0% on a composite of correctness, significance, and sufficiency of evidence, compared to 48.2% for the top human reviewer (p = 0.009). All three AI reviewers tested—GPT-5.2, Gemini 3.0 Pro, and Claude Opus 4.5—exceeded the lowest-rated human across every dimension.

The study, led by researchers including Seungone Kim and Dongkeun Yoon, marks the first large-scale expert evaluation of AI reviewers on individual criticisms rather than overall verdicts. Each criticism targeted one specific aspect of a paper—methodology, data analysis, interpretation, or presentation. AI reviewers' accurate criticisms were more often rated significant and well-evidenced than inaccurate ones, and surfaced a distinct 26% of issues no human raised.

Overlap and blind spots

AI reviewers overlapped far more than humans: 21% of criticism pairs from different AI reviewers targeted the same issue, compared to 3% for human pairs. The study identified 16 recurring weaknesses AI reviewers exhibit that humans do not, including limited subfield knowledge, lack of long-context management over multiple files, and an overly critical stance on minor issues. One example: AI reviewers flagged formatting inconsistencies or minor statistical choices that human reviewers judged insignificant in context.

The authors position current AI reviewers as complements to human reviewers, not substitutes. While AI systems can surface a broader range of issues and maintain consistency across dimensions, they lack the deep domain expertise and contextual judgment that top human reviewers bring to evaluating research significance. The 469-hour annotation effort represents one of the most resource-intensive evaluations of AI peer review to date, involving scientists who regularly review for Nature-family journals.

Overlap and blind spots

More in Releases