This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff
翻译:本文提出了一种新颖的基于扩散的模型CompoDiff,用于通过潜在扩散解决零样本组合图像检索(ZS-CIR)。同时,本文引入了一个名为SynthTriplets18M的新型合成数据集,包含1880万张参考图像、条件及对应的目标图像三元组,用于训练CIR模型。CompoDiff与SynthTriplets18M解决了以往CIR方法因数据集规模小、条件类型有限而泛化能力不足的缺陷。CompoDiff不仅在FashionIQ、CIRR、CIRCO和GeneCIS四个ZS-CIR基准测试中取得了新的最优性能,还通过支持负文本、图像遮罩等多种条件,实现了更灵活、可控的CIR。此外,CompoDiff展示了文本与图像查询间条件强度的可控性,以及推理速度与性能之间的权衡能力,这些特性是现有CIR方法无法实现的。代码和数据集开源地址:https://github.com/navervision/CompoDiff