This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL) to learn an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of a well-trained dense retriever, T5-ANCE, by incorporating the image features encoded by the visual module as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and exact the related texts and image documents from anchor linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the visual module plugin method is tailored to enable the image understanding ability for an existing dense retrieval model. Besides, we also show that the language model has the ability to extract image semantics from image encoders and adapt the image features in the input space of language models. All codes are available at https://github.com/OpenMatch/MARVEL.
翻译:本文提出了一种基于视觉模块插件的多模态检索模型(MARVEL),旨在为查询和多模态文档学习嵌入空间以进行检索。MARVEL采用统一编码器模型对查询和多模态文档进行编码,有助于缓解图像与文本之间的模态差异。具体而言,我们通过将视觉模块编码的图像特征作为输入,增强了经过良好训练的密集检索器T5-ANCE的图像理解能力。为促进多模态检索任务,我们基于ClueWeb22数据集构建了ClueWeb22-MM数据集,将锚文本视为查询,并从锚链接的网页中提取相关文本和图像文档。实验表明,MARVEL在多模态检索数据集WebQA和ClueWeb22-MM上显著优于现有最先进方法。进一步分析表明,视觉模块插件方法专为赋予现有密集检索模型图像理解能力而设计。此外,我们还展示了语言模型能够从图像编码器中提取图像语义,并在语言模型的输入空间中适配图像特征。所有代码均可在https://github.com/OpenMatch/MARVEL获取。