Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.
翻译:视觉语义嵌入(Visual Semantic Embedding, VSE)旨在提取图像及其描述文本的语义,并将其嵌入至同一潜在空间以实现跨模态信息检索。现有的大多数VSE网络通过采用硬负例损失函数进行训练,该函数在相关与不相关的图像-描述嵌入对之间的相似度上学习一个目标间隔。然而,硬负例损失函数中的目标间隔被设定为固定超参数,忽略了不相关图像-描述对的语义差异。为解决在获取训练后的VSE网络之前衡量图像-描述对之间最优相似度的难题,本文提出了一种包含两个主要部分的新方法:(1)挖掘图像描述的底层语义;(2)提出一种新颖的语义增强硬负例损失函数,其中学习目标基于不相关图像-描述对之间的最优相似度分数动态确定。通过将所提方法集成到五种最先进的VSE网络中,并在三个基准数据集上针对跨模态信息检索任务进行了大量实验。结果表明,所提方法实现了最佳性能,并且可被现有及未来的VSE网络采纳应用。