Zero-shot Composed Text-Image Retrieval

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/

翻译：本文研究组合图像检索（CIR）问题，旨在训练一个能够融合多模态信息（如文本与图像）的模型，以精确检索与查询相匹配的图像，扩展用户的表达能力。我们的贡献如下：（i）提出一种可扩展的数据构建流程，通过利用大规模图像-文本对数据集（如LAION-5B的子集）自动构建CIR模型训练数据集；（ii）引入基于Transformer的自适应聚合模型TransAgg，该模型采用简单而高效的融合机制，自适应地整合来自不同模态的信息；（iii）进行广泛的消融实验，以验证所提数据构建流程的实用性和TransAgg核心组件的有效性；（iv）在零样本场景下的公开基准数据集（即：在自动构建的数据集上训练，直接对目标下游数据集如CIRR和FashionIQ进行推理）上进行评估时，所提方法性能与现有最先进（SOTA）模型持平或显著超越。项目页面：https://code-kunkun.github.io/ZS-CIR/

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日