CiteVQA exposes 53-point gap between open and closed models on document citation accuracy
New benchmark forces multimodal LLMs to cite bounding-box evidence for every answer, revealing that even top models land on correct answers while pointing to the wrong source text.

CiteVQA, a document-intelligence benchmark released this week, requires multimodal large language models to return element-level bounding-box citations alongside each answer. The dataset comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, with documents averaging 40.6 pages. Ground-truth citations are generated by an automated masking-ablation pipeline and validated through expert review.
Current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage. In high-stakes domains like law, finance, and medicine, every conclusion must be traceable to a specific source region. A contract-review system that extracts the right termination date but cites the wrong clause creates liability rather than value.
The benchmark introduces Strict Attributed Accuracy (SAA), a metric that credits a prediction only when both the answer and the cited region are correct. Testing 20 MLLMs revealed what the authors call Attribution Hallucination: models frequently produce the right answer while citing the wrong passage. Gemini-3.1-Pro-Preview, the strongest closed system, achieves an SAA of 76.0 percent — meaning one in four correct answers points to the wrong source region. The strongest open-source MLLM reaches just 22.5 percent, a 53.5-point gap that underscores how far open models lag on this dimension. The benchmark and evaluation code are available at github.com/opendatalab/CiteVQA.