This paper proposes a novel diffusion-based model, CompoDiff, for solving Composed Image Retrieval (CIR) with latent diffusion and presents a newly created dataset of 18 million reference images, conditions, and corresponding target image triplets to train the model. CompoDiff not only achieves a new zero-shot state-of-the-art on a CIR benchmark such as FashionIQ but also enables a more versatile CIR by accepting various conditions, such as negative text and image mask conditions, which are unavailable with existing CIR methods. In addition, the CompoDiff features are on the intact CLIP embedding space so that they can be directly used for all existing models exploiting the CLIP space. The code and dataset used for the training, and the pre-trained weights are available at https://github.com/navervision/CompoDiff
翻译:本文提出一种新颖的扩散模型CompoDiff,用于解决基于潜在扩散的组合图像检索(CIR)问题,并构建包含1800万张参考图像、条件及其对应目标图像三元组的新数据集以训练该模型。CompoDiff不仅在FashionIQ等CIR基准测试中实现了新的零样本最优性能,还通过支持现有CIR方法无法处理的多种条件(如负文本和图像掩码条件)实现了更通用的组合图像检索。此外,CompoDiff特征位于完整的CLIP嵌入空间中,可直接应用于所有利用CLIP空间的现有模型。训练代码、数据集及预训练权重已在https://github.com/navervision/CompoDiff 开源。