Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.
翻译:图像文本检索是跨模态检索的主要任务之一。现有方法通常将图像和文本映射至公共空间,以建立两种模态间的对应关系。然而,由于图像内容的语义丰富性,其中的冗余次要信息可能导致错误匹配。针对此问题,本文提出一种语义优化方法,即视觉语义损失(VSL),以辅助模型聚焦于图像的主要内容。该方法的灵感来源于人类通常通过描述图像主要内容来标注其内容的习惯。因此,我们利用与图像对应的标注文本,辅助模型捕捉图像的主要语义,减少次要内容的负面影响。在两个基准数据集(MSCOCO和Flickr30K)上的大量实验表明,本方法具有优越性能。代码已在 https://github.com/ZhangXu0963/VSL 开源。