Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and inter-modal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote cross-modal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.
翻译:图像-文本检索旨在弥合模态鸿沟,基于语义相似性实现跨模态内容检索。现有研究通常聚焦于两两关系(即数据样本是否匹配),却忽略了高阶邻居关系(即多个数据样本间的匹配结构)。重排序作为一种流行的后处理技术,已在单模态检索任务中展现出挖掘邻居关系的优势,但直接将现有重排序算法拓展至图像-文本检索存在局限性。本文从通用性、灵活性、稀疏性及非对称性四个维度剖析其根本原因,并提出可学习的支柱重排序范式。具体而言,我们首先选取排名靠前的模态内与模态间邻居作为支柱,继而通过数据样本与支柱间的邻域关系重构样本表征。该方法仅利用相似度即可将每个样本映射至多模态支柱空间,确保通用性。随后设计邻域感知图推理模块,灵活挖掘邻域内稀疏正样本的关联结构,并引入结构对齐约束以促进跨模态协作、对齐非对称模态。我们在Flickr30K和MS-COCO两个基准数据集上,基于多种基础骨干网络开展大量实验,充分验证了所提重排序范式的有效性、优越性、通用性及可迁移性。