VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck is the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a Cross-Modal verification framework that generates questions and answers purely from figure-citing paragraphs, then verifies them against the figures themselves, leveraging the inherent text-figure alignment in scientific papers to filter out erroneous QA pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,272 QA pairs spanning 20 scientific domains and 12 figure types. Difficulty assessment reveals a notable accuracy gap between the best open-source model (65%) and the best proprietary model (80.5%), demonstrating room for improvement. Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size, surpassing models trained on existing datasets. Human evaluation further validates the improved quality of VeriSciQA. These results demonstrate that continued data expansion via our scalable framework can further advance SVQA capability in the open-source community. Our dataset is publicly available at https://huggingface.co/datasets/datajuicer/VeriSciQA.

翻译：大型视觉语言模型在科学应用中展现出潜力，但开源模型在科学视觉问答任务上仍面临挑战，即回答关于科学论文中图表的问题。一个关键瓶颈在于缺乏公开、大规模、高质量的SVQA数据集。尽管近期研究利用LVLMs大规模合成数据，但我们发现其生成的问答对存在系统性错误，这源于LVLMs的固有局限性以及图表与文本之间的信息不对称。为应对这些挑战，我们提出一种跨模态验证框架：该框架仅从引用图表的段落生成问题与答案，随后通过图表本身进行验证，利用科学论文中文本与图表固有的对齐关系来过滤错误问答对。我们基于该框架构建了VeriSciQA数据集，包含20,272个问答对，涵盖20个科学领域和12种图表类型。难度评估显示最佳开源模型（65%）与最佳专有模型（80.5%）之间存在显著准确率差距，表明仍有改进空间。此外，在VeriSciQA上微调的模型在SVQA基准测试中取得稳定提升，其性能增益随数据规模扩大而增加，优于基于现有数据集训练的模型。人工评估进一步验证了VeriSciQA的质量提升。这些结果表明，通过我们可扩展的框架持续扩展数据能够进一步推动开源社区的SVQA能力发展。我们的数据集已公开发布于https://huggingface.co/datasets/datajuicer/VeriSciQA。