Image text retrieval is a task to search for the proper textual descriptions of the visual world and vice versa. One challenge of this task is the vulnerability to input image and text corruptions. Such corruptions are often unobserved during the training, and degrade the retrieval model decision quality substantially. In this paper, we propose a novel image text retrieval technique, referred to as robust visual semantic embedding (RVSE), which consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT). Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors regardless of the corruption, improving the model robustness significantly. From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
翻译:图像文本检索是一项旨在搜索视觉世界对应文本描述(反之亦然)的任务。该任务面临的挑战之一是对输入图像与文本损坏的脆弱性。此类损坏通常在训练过程中未被观测到,但会显著降低检索模型的决策质量。本文提出一种新颖的图像文本检索技术,称为鲁棒视觉语义嵌入(RVSE),其包含两种创新的基于图像和文本的增强技术:图像语义保持增强(SPAugI)与文本语义保持增强(SPAugT)。由于SPAugI与SPAugT在保持原始数据语义信息的前提下进行变换,我们强制特征提取器生成语义感知的嵌入向量,从而显著提升模型鲁棒性。通过在基准数据集上进行的大量实验表明,RVSE在图像文本检索性能上优于传统检索方案。