SciDraw-Bench: domain-specific AI outperforms general models on scientific diagrams
New arXiv benchmark tests text-to-image models on 32 scientific-figure tasks across eight diagram types, revealing that domain-specific systems outperform general-purpose models on label fidelity and convention adherence.

Researchers have introduced SciDraw-Bench, a benchmark designed to evaluate whether generative models can produce usable scientific figures — mechanism diagrams, experimental schematics, conceptual frameworks, and graphical abstracts. The preprint, posted to arXiv on June 30, argues that existing image-generation benchmarks measure photorealism and object counting but ignore what makes a scientific figure work: correct text labels, faithful entity relationships, coherent structure, and adherence to disciplinary drawing conventions.
The benchmark comprises 32 structured tasks spanning eight figure types and ten disciplines. Each task pairs a natural-language prompt with a machine-checkable specification listing required labels, relations, components, conventions, and negative constraints. The evaluation protocol scores four dimensions: Text Fidelity (OCR-based label recall and character error rate), Semantic Correctness (vision-language-model judging against the specification), Structural Quality, and Convention Adherence.
Pilot results
When tested across all eight figure types, a domain-specific system called SciDraw AI substantially outperformed representative general-purpose text-to-image models on every dimension and figure type. The largest gaps appeared in semantic correctness and convention adherence — the ability to follow disciplinary norms for arrows, labels, and layout. Text fidelity remains the hardest dimension for all systems; even the best performer struggled with OCR-verifiable label accuracy.
The authors outline a code-to-figure baseline as a planned extension, suggesting that programmatic generation may offer a floor for correctness that pixel-space diffusion models have yet to match. A meta-evaluation protocol and preliminary inter-judge reliability analysis accompany the benchmark; human-rating validation is ongoing.




