Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.
翻译:人类对视觉相似性的感知具有固有的适应性和主观性,取决于用户的兴趣和关注点。然而,大多数图像检索系统未能反映这种灵活性,依赖于固定且单一的度量标准,无法同时融入多个条件。为解决这一问题,我们提出了CLAY,一种自适应相似性计算方法,它将预训练视觉语言模型(VLM)的嵌入空间重新构建为文本条件性相似性空间,且无需额外训练。该设计将文本条件化过程与视觉特征提取相分离,从而能够利用固定的视觉嵌入实现高效的多条件检索。我们还构建了合成评估数据集CLAY-EVAL,用于在多样化的条件性检索场景下进行全面评估。在标准数据集和我们提出的数据集上的实验表明,与先前工作相比,CLAY在实现高检索准确率的同时,具有显著的计算效率。