In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at https://github.com/XuriGe1995/3SHNet.
翻译:本文提出了一种新颖的视觉语义-空间自增强网络(简称3SHNet),用于实现高精度、高效率和强泛化能力的图像-句子检索。3SHNet通过突出视觉模态中显著目标及其空间位置的识别,从而整合视觉语义-空间交互并保持两个模态间的独立性。该整合有效结合了分割得到的物体区域与对应的语义及位置布局,以增强视觉表征;而模态独立性则保证了效率与泛化能力。此外,3SHNet利用分割得到的结构化上下文视觉场景信息进行局部(基于区域)或全局(基于网格)引导,实现精准的混合层级检索。在MS-COCO和Flickr30K基准上的大量实验证明了所提3SHNet相较于当前最先进方法在卓越性能、推理效率与泛化性方面的优势。具体而言,在更大的MS-COCO 5K测试集上,与采用不同图像表征的最先进方法相比,我们的rSum评分分别提升了16.3%、24.8%和18.3%,同时保持了最优的检索效率。此外,跨数据集泛化性能提升了18.6%。数据和代码已开源至https://github.com/XuriGe1995/3SHNet。