MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module's encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also illustrate that the language model has the ability to extract image semantics and partly map the image features to the input word embedding space. All codes are available at https://github.com/OpenMatch/MARVEL.

翻译：本文提出了一种通过视觉模块插件实现的多模态检索模型（MARVEL），该模型通过学习查询与多模态文档的嵌入空间以执行检索任务。MARVEL采用统一的编码器模型对查询和多模态文档进行编码，有助于缓解图像与文本之间的模态差异。具体而言，我们通过将视觉模块编码的图像特征作为输入，使经过充分训练的稠密检索器T5-ANCE具备图像理解能力。为促进多模态检索研究，我们基于ClueWeb22数据集构建了ClueWeb22-MM数据集，该数据集以锚文本作为查询，并从锚链接的网页中提取相关文本与图像文档。实验表明，MARVEL在多模态检索数据集WebQA和ClueWeb22-MM上显著优于现有最优方法。MARVEL为将文本检索的优势拓展至多模态场景提供了可能。此外，我们还揭示了语言模型具备提取图像语义的能力，并能将图像特征部分映射至输入词嵌入空间。所有代码已开源：https://github.com/OpenMatch/MARVEL。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日