Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
翻译:组合图像检索是一项具有挑战性的视觉-语言任务,它利用双模态(图像+文本)查询来检索目标图像。尽管有监督的组合图像检索取得了令人印象深刻的性能,但其对成本高昂、人工标注三元组的依赖限制了其可扩展性和零样本能力。为解决此问题,零样本组合图像检索及其基于投影的方法被提出。然而,此类方法面临两个主要问题:预训练(图像 $\leftrightarrow$ 文本)与推理(图像+文本 $\rightarrow$ 图像)之间的任务差异,以及模态差异。后者主要涉及基于纯文本投影训练的方法,因为在推理过程中需要从参考图像中提取特征。在本文中,我们提出了一个两阶段框架来解决这两种差异。首先,为确保效率和可扩展性,在大规模字幕数据集上预训练一个文本反转网络。随后,我们提出模态-任务双重对齐作为第二阶段,其中利用大语言模型生成用于微调的三元组数据,此外,在多模态上下文中引入提示学习,以有效缓解模态和任务差异。实验结果表明,我们的 MoTaDual 在四个广泛使用的零样本组合图像检索基准测试中实现了最先进的性能,同时保持了较低的训练时间和计算成本。代码即将发布。