Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.

翻译：组合图像检索（CIR）是指根据参考图像及其文本描述（该文本以自然语言描述对参考图像的修改）来检索匹配图像的任务。传统上，针对CIR设计的模型依赖于包含参考图像、重构文本和目标图像的三元组数据。然而，构建此类三元组数据通常需要人工干预，导致成本高昂。这一挑战阻碍了CIR模型训练的可扩展性，即使存在大量未标注数据可用。随着基础模型的最新进展，我们倡导转变CIR训练范式，即通过大型语言模型（LLMs）有效替代人工标注。具体而言，我们展示了大型图像描述模型和语言模型仅依赖未标注图像集即可高效生成CIR数据的能力。此外，我们提出了一种能有效融合图像与文本模态的嵌入重构架构。我们的模型InstructCIR在CIRR和FashionIQ数据集上的零样本组合图像检索任务中超越了现有最优方法。进一步研究表明，通过增加生成数据量，我们的零样本模型性能可逼近有监督基线模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日