This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module's encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also illustrate that the language model has the ability to extract image semantics and partly map the image features to the input word embedding space. All codes are available at https://github.com/OpenMatch/MARVEL.
翻译:本文提出了一种通过视觉模块插件实现的多模态检索模型(MARVEL),该模型通过学习查询与多模态文档的嵌入空间以执行检索任务。MARVEL采用统一的编码器模型对查询和多模态文档进行编码,有助于缓解图像与文本之间的模态差异。具体而言,我们通过将视觉模块编码的图像特征作为输入,使经过充分训练的稠密检索器T5-ANCE具备图像理解能力。为促进多模态检索研究,我们基于ClueWeb22数据集构建了ClueWeb22-MM数据集,该数据集将锚文本视为查询,并从锚链接网页中提取相关文本与图像文档。实验表明,MARVEL在多模态检索数据集WebQA和ClueWeb22-MM上显著优于现有最优方法。MARVEL为将文本检索的优势拓展至多模态场景提供了可能。此外,我们还揭示了语言模型具备提取图像语义的能力,并能将图像特征部分映射至输入词嵌入空间。所有代码已公开于https://github.com/OpenMatch/MARVEL。