The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multi-faceted query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multi-faceted matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a muLtI-faceted Matching Network (LIMN), which consists of three key modules: multi-grained image/text encoder, latent factor-oriented feature aggregation, and query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a semi-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on three real-world datasets, FashionIQ, Shoes, and Birds-to-Words, show that our proposed method significantly surpasses the state-of-the-art baselines.
翻译:组合图像检索(CIR)任务旨在为给定的多模态查询(即参考图像及其对应的修改文本)检索所需的目标图像。现有研究面临的关键局限在于两个方面:1)忽略了多方面的查询-目标匹配因素;2)忽略了现有基准数据集中潜在未标注的参考-目标图像对。解决这两个局限并非易事,主要面临以下挑战:1)如何在缺乏直接监督信号的情况下,以隐式方式有效建模多方面匹配因素;2)如何充分利用潜在未标注的参考-目标图像对以提升CIR模型的泛化能力。为应对这些挑战,本研究首先提出一个多因素匹配网络(LIMN),该网络包含三个关键模块:多粒度图像/文本编码器、面向隐式因素的特征聚合以及查询-目标匹配建模。随后,我们设计了一种迭代双重自训练范式,通过以半监督方式充分利用潜在未标注的参考-目标图像对,进一步提升LIMN的性能。具体而言,我们将经迭代双重自训练范式增强的LIMN记为LIMN+。在三个真实数据集(FashionIQ、Shoes和Birds-to-Words)上的大量实验表明,我们提出的方法显著超越了现有最先进的基线模型。