Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.
翻译:多模态信息检索(MMIR)是一个快速发展的领域,尤其在图像-文本配对方面,通过先进的表示学习和跨模态对齐研究取得了显著进展。然而,当前用于评估科学领域内图像-文本配对MMIR性能的基准存在明显缺口,通常以学术语言描述的图表和表格图像并未发挥重要作用。为弥补这一缺口,我们利用开放获取的论文集合,开发了一个专门针对科学领域的MMIR基准(SciMMIR)。该基准包含53万对精心整理的图像-文本对,这些对从科学文档中带有详细标题的图表和表格中提取。此外,我们为这些图像-文本对标注了两级子类别-子层次标注体系,以促进对基线的更全面评估。我们在CLIP和BLIP等主流多模态图像描述与视觉语言模型上进行了零样本和微调评估。我们的分析为科学领域的MMIR提供了关键见解,包括预训练与微调设置的影响,以及视觉与文本编码器的作用。所有数据和检查点均已公开,可通过https://github.com/Wusiwei0410/SciMMIR 获取。