This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff
翻译:本文提出了一种新颖的基于扩散的模型CompoDiff,用于通过潜在扩散解决零样本组合图像检索问题。本文还引入了一个名为SynthTriplets18M的新型合成数据集,包含1880万组参考图像、条件描述及对应目标图像的三元组,用于训练CIR模型。CompoDiff与SynthTriplets18M共同解决了以往CIR方法存在的缺陷,例如因数据集规模过小和条件类型有限导致的泛化能力不足问题。CompoDiff不仅在FashionIQ、CIRR、CIRCO和GeneCIS这四个ZS-CIR基准测试中取得了最先进的性能,还能通过接受多种条件输入(如否定文本和图像掩码条件)实现更通用且可控的CIR。该模型还展示了文本与图像查询间条件强度的可控性,以及推理速度与性能间的平衡调节能力,这些特性都是现有CIR方法所不具备的。代码与数据集已公开于https://github.com/navervision/CompoDiff。