Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.
翻译:组合图像检索是一种图像检索任务,用户提供参考图像作为起点,并指定文本描述如何从起点偏移到目标图像。然而,现有大多数方法侧重于文本与参考图像的组合学习,将文本简化为描述,忽视了文本的内在结构及用户的偏移意图。这导致这些方法通常走捷径,忽略了参考图像的视觉线索。为解决此问题,我们重新将文本视为指令,并提出语义偏移网络(SSN),该网络明确将语义偏移分解为两步:从参考图像到视觉原型,以及从视觉原型到目标图像。具体而言,SSN将指令明确分解为降级和升级两个组件,其中降级用于从参考图像描绘视觉原型,而升级则用于丰富视觉原型,形成最终表示以检索目标图像。实验结果表明,所提出的SSN在CIRR和FashionIQ数据集上分别实现了5.42%和1.37%的显著性能提升,并达到了当前最先进水平。代码将公开提供。