Scientific documents contain complex multimodal structures, which makes evidence localization and scientific reasoning in Document Visual Question Answering particularly challenging. However, most existing benchmarks evaluate models only at the page level without explicitly annotating the evidence regions that support the answer, which limits both interpretability and the reliability of evaluation. To address this limitation, we introduce SciEGQA, a scientific document question answering and reasoning dataset with semantic evidence grounding, where supporting evidence is represented as semantically coherent document regions annotated with bounding boxes. SciEGQA consists of two components: a **human-annotated fine-grained benchmark** containing 1,623 high-quality question--answer pairs, and a **large-scale automatically constructed training set** with over 30K QA pairs generated through an automated data construction pipeline. Extensive experiments on a wide range of Vision-Language Models (VLMs) show that existing models still struggle with evidence localization and evidence-based question answering in scientific documents. Training on the proposed dataset significantly improves the scientific reasoning capabilities of VLMs. The project page is available at https://yuwenhan07.github.io/SciEGQA-project/.
翻译:科学文档包含复杂的多模态结构,这使得文档视觉问答中的证据定位和科学推理尤为困难。然而,现有大多数基准仅在页面级别评估模型,而未显式标注支撑答案的证据区域,这限制了模型的可解释性和评估的可靠性。为应对这一局限,我们提出了SciEGQA——一个具有语义证据定位的科学文档问答与推理数据集,其中支撑证据被表示为带有边界框标注的语义连贯文档区域。SciEGQA由两部分组成:一个**人工标注的精细基准**,包含1,623个高质量问答对;以及一个**大规模自动构建的训练集**,通过自动化数据构建流程生成超过3万个问答对。在广泛视觉语言模型(VLM)上的大量实验表明,现有模型在科学文档的证据定位和基于证据的问答方面仍存在困难。在所提出数据集上的训练显著提升了VLM的科学推理能力。项目页面访问地址为:https://yuwenhan07.github.io/SciEGQA-project/。