This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion .
翻译:本文针对时尚领域中图像相似性搜索面临的新挑战,旨在解决由复杂图像引发的该领域固有歧义问题。我们提出参照视觉搜索(Referred Visual Search, RVS)任务,该任务顺应业界近期研究趋势,允许用户更精确地定义所需的相似性标准。我们发布了公开大型数据集LRVS-Fashion,该数据集包含从时尚目录中提取的272K款时尚产品及842K张图像,专为此任务设计。然而,不同于业界传统视觉搜索方法,我们证明通过绕过显式目标检测,并采用弱监督条件对比学习处理图像三元组,可获得更优性能。所提方法轻量且鲁棒,在包含200万个干扰项的检索任务中,其Top-1召回率显著优于基于强检测基线的方案。数据集已发布于https://huggingface.co/datasets/Slep/LAION-RVS-Fashion。