This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL) to learn an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of a well-trained dense retriever, T5-ANCE, by incorporating the image features encoded by the visual module as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and exact the related texts and image documents from anchor linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the visual module plugin method is tailored to enable the image understanding ability for an existing dense retrieval model. Besides, we also show that the language model has the ability to extract image semantics from image encoders and adapt the image features in the input space of language models. All codes are available at https://github.com/OpenMatch/MARVEL.
翻译:本文提出基于视觉模块插件的多模态检索模型(MARVEL),旨在学习查询与多模态文档的嵌入空间以实现检索。MARVEL采用统一编码器对查询和多模态文档进行编码,有助于缓解图像与文本之间的模态差异。具体而言,我们通过将视觉模块编码的图像特征作为输入,赋予经过良好训练的稠密检索器T5-ANCE图像理解能力。为促进多模态检索任务,我们基于ClueWeb22数据集构建了ClueWeb22-MM数据集,将锚文本作为查询,并从锚链接网页中提取相关文本和图像文档。实验表明,MARVEL在多模态检索数据集WebQA和ClueWeb22-MM上显著优于现有最优方法。进一步分析显示,视觉模块插件方法专为赋予现有稠密检索模型图像理解能力而设计。此外,我们还证明语言模型具备从图像编码器中提取图像语义的能力,并能将图像特征适配至语言模型的输入空间。所有代码已开源至https://github.com/OpenMatch/MARVEL。