Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
翻译:组合图像检索(Composed Image Retrieval, CIR)是一项根据提供的文本修改内容检索与查询图像相似图像的任务。当前技术依赖监督学习,使用参考图像、文本、目标图像的有标签三元组训练CIR模型。这些特定三元组并不像简单的图像-文本对那样普遍可得,限制了CIR的广泛应用及其可扩展性。另一方面,零样本CIR可以相对容易地利用图像-描述对进行训练,而无需考虑图像到图像的关系,但这种方法往往准确率较低。我们提出了一种新的半监督CIR方法,在辅助数据中搜索参考图像及其相关目标图像,并训练基于大语言模型的视觉差异生成器(Visual Delta Generator, VDG)生成描述两者之间视觉差异(即视觉差异)的文本。VDG具备流畅的语言知识且模型无关,能够生成伪三元组以提升CIR模型的性能。我们的方法显著改进了现有监督学习方法,并在CIR基准测试中取得了最先进的结果。