Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.
翻译:组合图像检索(CIR)是指根据参考图像及其文本描述(该文本以自然语言描述对参考图像的修改)来检索匹配图像的任务。传统上,针对CIR设计的模型依赖于包含参考图像、重构文本和目标图像的三元组数据。然而,构建此类三元组数据通常需要人工干预,导致成本高昂。这一挑战阻碍了CIR模型训练的可扩展性,即使存在大量未标注数据可用。随着基础模型的最新进展,我们倡导转变CIR训练范式,即通过大型语言模型(LLMs)有效替代人工标注。具体而言,我们展示了大型图像描述模型和语言模型仅依赖未标注图像集即可高效生成CIR数据的能力。此外,我们提出了一种能有效融合图像与文本模态的嵌入重构架构。我们的模型InstructCIR在CIRR和FashionIQ数据集上的零样本组合图像检索任务中超越了现有最优方法。进一步研究表明,通过增加生成数据量,我们的零样本模型性能可逼近有监督基线模型。