Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively increasing the alignment similarity of prominent words. Additionally, we incorporate prior textual information into image regions to reduce misalignment occurrences. In practice, we first design a novel intra-modal fragments relationship reasoning method, and subsequently employ our proposed alignment mechanism to compute the similarity between images and text. Extensive quantitative comparative experiments on MS-COCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.
翻译:图像-文本检索是计算机视觉领域中一个广泛研究的主题,由于多媒体数据的指数级增长,其核心概念是衡量图像与文本之间的相似度。然而,现有的大多数检索方法严重依赖交叉注意力机制进行跨模态细粒度对齐,这种方法考虑了过多不相关区域,并平等对待显著词和非显著词,从而限制了检索精度。本文旨在研究一种对齐方法,该方法能减少图像和文本中非显著片段的参与,同时增强显著片段的对齐。为此,我们提出了跨模态显著片段增强对齐网络(CPFEAN),通过降低对齐过程中不相关区域的参与度,并相对增加显著词的对齐相似度,从而实现更高的检索精度。此外,我们将先验文本信息融入图像区域以减少对齐错误的发生。在实践中,我们首先设计了一种新颖的模态内片段关系推理方法,随后采用我们提出的对齐机制计算图像与文本之间的相似度。在MS-COCO和Flickr30K数据集上的广泛定量对比实验表明,我们的方法在rSum指标上比现有最先进方法提升了约5%至10%。