Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.
翻译:科学图表解读是基于先进大型视觉语言模型的人工智能科学助手的关键能力。然而,当前的数据集和基准测试主要集中于有限科学领域中的简单图表或其他相对直观的图示。为弥补这一空白,我们提出了一个从同行评审的《自然·通讯》文章中汇编而成的综合性数据集,涵盖72个科学领域,包含示意图、显微图像和实验数据等需要研究生层次专业知识才能解读的复杂可视化内容。我们在两个基准任务(图表标题生成和多项选择题)上评估了19个专有和开源模型,并进行了人类专家标注。我们的分析揭示了任务的重要挑战以及模型之间的显著性能差距。除了作为基准测试外,该数据集还可作为大规模训练的宝贵资源。使用我们的任务特定数据对Qwen2-VL-7B进行微调后,其在多项选择题评估中的表现优于GPT-4o,甚至超过了人类专家。此外,在我们的交错文章与图表数据上进行持续预训练,显著提升了模型在材料科学下游任务上的性能。我们已公开发布此数据集以支持进一步研究。