Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches. Due to the expensive dataset construction cost for CIR triplets, a zero-shot (ZS) CIR setting has been actively studied to eliminate the need for human-collected triplet datasets. The mainstream of ZS-CIR employs an efficient projection module that projects a CLIP image embedding to the CLIP text token embedding space, while fixing the CLIP encoders. Using the projected image embedding, these methods generate image-text composed features by using the pre-trained text encoder. However, their CLIP image and text encoders suffer from the task discrepancy between the pre-training task (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image). Conceptually, we need expensive triplet samples to reduce the discrepancy, but we use cheap text triplets instead and update the text encoder. To that end, we introduce the Reducing Task Discrepancy of text encoders for Composed Image Retrieval (RTD), a plug-and-play training scheme for the text encoder that enhances its capability using a novel target-anchored text contrastive learning. We also propose two additional techniques to improve the proposed learning scheme: a hard negatives-based refined batch sampling strategy and a sophisticated concatenation scheme. Integrating RTD into the state-of-the-art projection-based ZS-CIR methods significantly improves performance across various datasets and backbones, demonstrating its efficiency and generalizability.
翻译:组合图像检索(CIR)旨在基于参考图像和条件文本检索目标图像,从而实现可控搜索。由于CIR三元组数据集构建成本高昂,零样本(ZS)CIR设置被积极研究,以消除对人工收集三元组数据集的需求。ZS-CIR的主流方法采用高效投影模块,将CLIP图像嵌入投影到CLIP文本标记嵌入空间,同时固定CLIP编码器。利用投影后的图像嵌入,这些方法通过预训练文本编码器生成图像-文本组合特征。然而,其CLIP图像和文本编码器面临预训练任务(文本↔图像)与目标CIR任务(图像+文本↔图像)之间的任务差异。概念上,我们需要昂贵的三元组样本来减少这种差异,但我们改用廉价文本三元组并更新文本编码器。为此,我们提出用于组合图像检索的文本编码器任务差异降低(RTD)方法——一种即插即用的文本编码器训练方案,通过新颖的锚定目标文本对比学习增强其能力。我们还提出两种附加技术改进所提出的学习方案:基于困难负例的精细批次采样策略和精巧的拼接方案。将RTD集成到基于投影的先进ZS-CIR方法中,显著提升了跨多种数据集和骨干网络的性能,证明了其高效性和泛化能力。