MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

Zekun Li,Xianjun Yang,Kyuri Choi,Wanrong Zhu,Ryan Hsieh,HyeonJung Kim,Jin Hyuk Lim,Sungyoung Ji,Byungju Lee,Xifeng Yan,Linda Ruth Petzold,Stephen D. Wilson,Woosang Lim,William Yang Wang

from arxiv, Code and data are available at https://github.com/Leezekun/MMSci

Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.

翻译：科学图表解读是基于先进大型视觉语言模型的人工智能科学助手的关键能力。然而，当前的数据集和基准测试主要集中于有限科学领域中的简单图表或其他相对直观的图示。为弥补这一空白，我们提出了一个从同行评审的《自然·通讯》文章中汇编而成的综合性数据集，涵盖72个科学领域，包含示意图、显微图像和实验数据等需要研究生层次专业知识才能解读的复杂可视化内容。我们在两个基准任务（图表标题生成和多项选择题）上评估了19个专有和开源模型，并进行了人类专家标注。我们的分析揭示了任务的重要挑战以及模型之间的显著性能差距。除了作为基准测试外，该数据集还可作为大规模训练的宝贵资源。使用我们的任务特定数据对Qwen2-VL-7B进行微调后，其在多项选择题评估中的表现优于GPT-4o，甚至超过了人类专家。此外，在我们的交错文章与图表数据上进行持续预训练，显著提升了模型在材料科学下游任务上的性能。我们已公开发布此数据集以支持进一步研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日