The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multi-faceted query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multi-faceted matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a muLtI-faceted Matching Network (LIMN), which consists of three key modules: multi-grained image/text encoder, latent factor-oriented feature aggregation, and query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a semi-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on three real-world datasets, FashionIQ, Shoes, and Birds-to-Words, show that our proposed method significantly surpasses the state-of-the-art baselines.
翻译:组合图像检索任务旨在根据给定的多模态查询(即参考图像及其对应的修改文本)检索出目标图像。现有工作面临两个关键局限:1)忽略了多面查询-目标匹配因素;2)忽略了现有基准数据集中潜在的无标签参考-目标图像对。解决这两个局限存在以下挑战:1)如何在缺乏直接监督信号的情况下以隐式方式有效建模多面匹配因素;2)如何充分利用潜在的无标签参考-目标图像对以提高组合图像检索模型的泛化能力。为应对这些挑战,本文首先提出多面匹配网络,该网络包含三个关键模块:多粒度图像/文本编码器、隐因子导向特征聚合以及查询-目标匹配建模。随后,我们设计了一种迭代式双重自训练范式,通过半监督方式充分利用潜在的无标签参考-目标图像对,进一步增强LIMN的性能。具体而言,我们将迭代双重自训练范式增强的LIMN记为LIMN+。在FashionIQ、Shoes和Birds-to-Words三个真实数据集上的大量实验表明,所提出的方法显著超越了现有最先进的基线模型。