PlantMarkerBench: 5,550 annotations test AI's grasp of plant gene evidence in papers
A new benchmark evaluates how well language models extract and classify marker-gene evidence from plant biology papers across Arabidopsis, maize, rice, and tomato.
PlantMarkerBench, a multi-species benchmark from researchers at Shanghai Jiao Tong University and the University of Georgia, tests whether language models can correctly interpret marker-gene evidence from full-text biological papers. The dataset covers Arabidopsis, maize, rice, and tomato — four widely studied plant species — and includes 5,550 sentence-level annotations drawn from scientific literature. Each instance is labeled for whether the sentence provides valid marker evidence for a specific gene–cell-type pair, the type of evidence (expression, localization, function, indirect, or negative), and the strength of support.
The benchmark defines two core tasks: determining if a candidate sentence actually supports a marker claim, and classifying the evidence into one of five categories. The authors built the dataset using a modular pipeline that combines large-scale literature retrieval, hybrid search, species-aware biological grounding, structured extraction, and targeted human review. The result is a reproducible framework for evaluating how well models handle the nuanced, context-dependent reasoning required to validate biological claims from primary literature.
Performance gaps
Frontier closed-source models perform reasonably well on direct expression evidence — sentences that explicitly state a gene is expressed in a cell type — but struggle with functional, indirect, and weak-support evidence. Evidence-type confusion is the dominant failure mode: models frequently misclassify functional evidence as expression evidence, or indirect evidence as direct. Open-weight models show elevated false-positive rates when biological context is ambiguous, often endorsing sentences that mention a gene and a cell type in the same paragraph without establishing a causal or spatial relationship.
The benchmark is available on HuggingFace. The authors frame it as a test bed for literature-grounded biological evidence attribution and a resource for improving AI-assisted plant biology workflows. The dataset's multi-species design and fine-grained evidence taxonomy make it a more challenging evaluation than existing marker-gene databases, which typically rely on curated annotations without modeling the underlying textual support.
