Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
翻译:近年来,多模态检索已成为一个前景广阔但充满挑战的研究方向。现有的大多数多模态检索研究侧重于捕获多模态数据中与其配对文本相似的信息,但往往忽略了多模态数据中包含的互补信息。在本研究中,我们提出了CIEA,一种新颖的多模态检索方法,它采用互补信息提取与对齐技术,将文档中的文本和图像都转换到一个统一的潜在空间,并配备了一个互补信息提取器,旨在识别并保留图像表征中的差异。我们使用两种互补的对比损失来优化CIEA,以确保语义完整性并有效捕获图像中包含的互补信息。大量实验证明了CIEA的有效性,相较于分治模型和通用稠密检索模型,它都取得了显著的性能提升。我们提供了消融研究、进一步讨论和案例研究,以凸显CIEA所取得的进展。为促进该领域的进一步研究,我们已在 https://github.com/zengdlong/CIEA 开源了源代码。