Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark. Our main benchmark consists of a 1000-question gold set of multiple-choice questions split between two tasks across 12 categories. The questions are curated from CS arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 26 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.
翻译:大型多模态模型(LMMs)已被证明在众多任务和领域中具备灵活性和泛化能力。尽管其在辅助科学研究方面潜力巨大,但该领域的性能尚未得到充分刻画。科学研究的一个关键维度在于理解与阐释图表——这些图表作为复杂信息的浓缩载体,具有信息密度高的特点。本文提出科学图表理解基准SciFIBench,其核心包含1000道多选题组成的金标准测试集,题目按12个类别划分并分为两个子任务。所有题目均源自计算机科学arXiv论文中的图表及其标题说明,通过对抗性过滤筛选困难负样本,并采用人工验证确保质量。在SciFIBench上评估了26个LMMs,结果表明该基准具有挑战性。最后,我们基于基准测试的增强题集,探究了LMMs的对齐性与推理忠实度。我们发布SciFIBench以推动该领域的研究进展。