Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.
翻译:大型多模态模型(LMMs)已在众多任务和领域中展现出其灵活性与泛化能力。尽管它们在辅助科学研究方面具有巨大潜力,但其在该领域的具体能力尚未得到充分刻画。科学研究的一个关键方面在于理解和解读图表的能力,图表是复杂信息的一种丰富且高度压缩的来源。本工作提出了SciFIBench,一个科学图表解读基准测试,包含来自8个类别的两个任务共计2000个问题。这些问题从arXiv论文的图表及其标题中精心筛选,采用对抗性过滤方法寻找困难负例,并经过人工验证以确保质量。我们在SciFIBench上评估了28个大型多模态模型,发现这是一个具有挑战性的基准。最后,我们基于基准测试的增强问题集,探究了这些模型的对齐性与推理忠实度。我们公开发布SciFIBench,以期推动该领域的进展。