Existing multimodal retrieval systems often rely on disjointed models for image comprehension, such as object detectors and caption generators, leading to cumbersome implementations and training processes. To overcome this limitation, we propose an end-to-end retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries via dynamic modality interaction. Ret-XKnow leverages a partial convolution mechanism to focus on visual information relevant to the given textual query, thereby enhancing multimodal query representations. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval (ViD2R) dataset automatically constructed from visual dialogue datasets. Our dataset construction process ensures that the dialogues are transformed into suitable information retrieval tasks using a text retriever. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios. Our code is publicly available: https://github.com/yeongjoonJu/Ret_XKnow.
翻译:现有的多模态检索系统通常依赖图像理解中的分离模型(如目标检测器和字幕生成器),导致实现和训练过程繁琐。为克服这一限制,我们提出一种端到端检索系统 Ret-XKnow,通过动态模态交互赋予文本检索器理解多模态查询的能力。Ret-XKnow 利用部分卷积机制聚焦于与给定文本查询相关的视觉信息,从而增强多模态查询表示。为有效学习多模态交互,我们还引入了从视觉对话数据集自动构建的视觉对话到检索(ViD2R)数据集。我们的数据集构建流程确保对话通过文本检索器被转化为合适的信息检索任务。实验表明,我们的方法不仅在零样本设置中显著提升了检索性能,在微调场景中也实现了实质性改进。代码已开源:https://github.com/yeongjoonJu/Ret_XKnow。