Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively increasing the alignment similarity of prominent words. Additionally, we incorporate prior textual information into image regions to reduce misalignment occurrences. In practice, we first design a novel intra-modal fragments relationship reasoning method, and subsequently employ our proposed alignment mechanism to compute the similarity between images and text. Extensive quantitative comparative experiments on MS-COCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.
翻译:图像-文本检索是计算机视觉领域中因多媒体数据指数级增长而广泛研究的热点课题,其核心在于度量图像与文本之间的相似度。然而,现有检索方法大多依赖交叉注意力机制进行跨模态细粒度对齐,这种机制会引入过多无关区域,并将显著词与非显著词等同对待,从而限制了检索精度。本文旨在探究一种对齐方法,通过减少图像与文本中非显著片段的参与,同时增强显著片段的对齐效果。为此,我们提出跨模态显著片段增强对齐网络(CPFEAN),通过降低无关区域在对齐过程中的参与度,并相对提升显著词的对齐相似度,实现检索精度的提升。此外,我们将先验文本信息融入图像区域以减少对齐偏差。在实际操作中,我们首先设计了一种新颖的模态内片段关系推理方法,随后采用所提出的对齐机制计算图像与文本间的相似度。在MS-COCO和Flickr30K数据集上的大量定量对比实验表明,本方法在rSum指标上较现有最优方法提升约5%至10%。