Unlock Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL) to learn an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of a well-trained dense retriever, T5-ANCE, by incorporating the image features encoded by the visual module as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and exact the related texts and image documents from anchor linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the visual module plugin method is tailored to enable the image understanding ability for an existing dense retrieval model. Besides, we also show that the language model has the ability to extract image semantics from image encoders and adapt the image features in the input space of language models. All codes are available at https://github.com/OpenMatch/MARVEL.

翻译：本文提出了一种基于视觉模块插件的多模态检索模型（MARVEL），旨在为查询和多模态文档学习嵌入空间以进行检索。MARVEL采用统一编码器模型对查询和多模态文档进行编码，有助于缓解图像与文本之间的模态差异。具体而言，我们通过将视觉模块编码的图像特征作为输入，增强了经过良好训练的密集检索器T5-ANCE的图像理解能力。为促进多模态检索任务，我们基于ClueWeb22数据集构建了ClueWeb22-MM数据集，将锚文本视为查询，并从锚链接的网页中提取相关文本和图像文档。实验表明，MARVEL在多模态检索数据集WebQA和ClueWeb22-MM上显著优于现有最先进方法。进一步分析表明，视觉模块插件方法专为赋予现有密集检索模型图像理解能力而设计。此外，我们还展示了语言模型能够从图像编码器中提取图像语义，并在语言模型的输入空间中适配图像特征。所有代码均可在https://github.com/OpenMatch/MARVEL获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日