We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.
翻译:我们提出SPUR,一个面向科学实验图像感知、理解与推理的综合基准,包含从1084张专家精选图像中衍生的4264个问答对。SPUR具有三个关键创新:(1)面板级细粒度感知:评估多模态大语言模型在六个细粒度面板类型上的数值、形态和信息定位三个维度的视觉感知能力;(2)跨面板关系理解:利用平均每样本包含14.3个面板的复杂图像评估多模态大语言模型解读精细跨面板关系的能力;(3)专家级推理:通过五个实验范式评估定性与定量推理,判断模型能否像人类专家那样从证据推断结论。对20个多模态大语言模型和四种多模态思维链方法的综合评估表明,当前模型远未达到科学图像解析的专家级要求,揭示了人工智能驱动科学(AI4S)研究的关键瓶颈。