The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io
翻译:大型多模态模型(LMMs)的快速发展使其能够应用于科学问题求解,但其细粒度能力仍未得到充分探索。本文提出SciVerse——一个多模态科学评估基准,通过五个不同版本共计5,735个测试实例对LMMs进行全面评估。我们旨在探究LMMs在三个关键维度的表现:科学知识理解、多模态内容解析以及思维链(CoT)推理。为揭示LMMs是否具备足够的科学专业知识,我们首先将每个问题转化为三个版本,分别对应解题所需的不同知识层级:无知识、轻知识和富知识版本。其次,为探索LMMs如何解析多模态科学内容,我们标注了另外两个版本:视觉富化版和纯视觉版,将更多问题信息从文本转移至图表。通过比较不同版本的结果,SciVerse系统性地检验了LMMs在科学领域的专业知识储备与视觉感知能力。此外,为严格评估CoT推理,我们提出一种新的科学CoT评估策略,对模型输出中的知识错误与逻辑错误进行分步评估。基于SciVerse对不同LMMs的广泛评估揭示了其在科学能力上的关键局限,并为未来发展提供了新见解。项目页面:https://sciverse-cuhk.github.io