This paper presents a new approach to image similarity search in the context of fashion, a domain with inherent ambiguity due to the multiple ways in which images can be considered similar. We introduce the concept of Referred Visual Search (RVS), where users provide additional information to define the desired similarity. We present a new dataset, LAION-RVS-Fashion, consisting of 272K fashion products with 842K images extracted from LAION, designed explicitly for this task. We then propose an innovative method for learning conditional embeddings using weakly-supervised training, achieving a 6% increase in Recall at one (R@1) against a gallery with 2M distractors, compared to classical approaches based on explicit attention and filtering. The proposed method demonstrates robustness, maintaining similar R@1 when dealing with 2.5 times as many distractors as the baseline methods. We believe this is a step forward in the emerging field of Referred Visual Search both in terms of accessible data and approach. Code, data and models are available at https://www.github.com/Simon-Lepage/CondViT-LRVSF .
翻译:本文提出了一种在时尚领域中进行图像相似性搜索的新方法,该领域因图像可视为相似的多种方式而存在固有歧义。我们引入了所指视觉搜索(RVS)的概念,即用户提供额外信息来定义期望的相似性。我们构建了一个新数据集LAION-RVS-Fashion,包含从LAION中提取的272K件时尚产品及其842K张图像,专为此任务设计。随后,我们提出了一种基于弱监督训练学习条件嵌入的创新方法,在包含200万干扰项的图库中,相比基于显式注意力与过滤的经典方法,实现了R@1(召回率第一名)6%的提升。该方法展现出鲁棒性:当干扰项数量为基础方法的2.5倍时,仍能保持相近的R@1性能。我们相信,这项研究在可获取数据与算法方法两方面均为新兴的所指视觉搜索领域迈出了重要一步。相关代码、数据与模型已开源在https://www.github.com/Simon-Lepage/CondViT-LRVSF。