Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
翻译:组合图像检索(CIR)旨在基于由参考图像和描述两幅图像间差异的相对说明组成的查询来检索目标图像。由于现有方法依赖监督学习,为CIR标注数据集所需的高昂人力与成本阻碍了其广泛应用。本文提出一项新任务——零样本组合图像检索(ZS-CIR),旨在无需标注训练数据集的情况下解决CIR问题。我们的方法名为基于文本嵌入反演的零样本组合图像检索(SEARLE),该方法将参考图像的视觉特征映射为CLIP标记嵌入空间中的伪词标记,并将其与相对说明结合。为支持ZS-CIR研究,我们引入一个开放域基准数据集——基于上下文常见对象的组合图像检索(CIRCO),这是首个包含每个查询多个真实标注的CIR数据集。实验表明,在CIR任务的两个主要数据集FashionIQ和CIRR以及所提出的CIRCO上,SEARLE的性能优于基准方法。数据集、代码和模型均已公开于https://github.com/miccunifi/SEARLE。