Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The S-Set exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
翻译:零样本组合图像检索(ZS-CIR)旨在基于参考图像和文本描述检索目标图像,而无需使用分布内三元组进行训练。一种主流方法遵循视觉语言预训练范式,采用映射网络将图像嵌入转换到文本嵌入空间中的伪词标记。然而,由于模态差异以及训练与推理之间的分布偏移,该方法往往会阻碍网络的泛化能力。为此,我们提出了一种数据高效泛化(DeG)框架,包含两个新颖的设计:文本补充(TS)模块和语义集合(S-Set)。TS模块在训练过程中利用组合文本语义,通过增强伪词标记的语言语义来有效缓解模态差异。S-Set则利用预训练视觉语言模型(VLMs)的零样本能力,减轻分布偏移并缓解大规模图文数据冗余导致的过拟合问题。在四个ZS-CIR基准上的大量实验表明,DeG以更少的训练数据超越了现有最优(SOTA)方法,并为实际应用节省了大量训练和推理时间。