Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.
翻译:许多图像检索研究使用度量学习来训练图像编码器。然而,度量学习无法处理用户偏好的差异,并且需要数据来训练图像编码器。为克服这些局限性,我们重新审视相关性反馈——一种用于交互式检索系统的经典技术——并提出了一种结合相关性反馈的基于CLIP的交互式图像检索系统。我们的检索系统首先执行检索,通过二元反馈收集每位用户的独特偏好,并返回用户偏好的图像。即使面对多样化的用户偏好,我们的检索系统也能通过反馈学习每位用户的偏好并适应之。此外,我们的检索系统利用CLIP的零样本迁移能力,无需训练即可实现高精度。我们通过实验表明,尽管没有为每个数据集单独训练图像编码器,我们的检索系统在基于类别的图像检索中仍能与最先进的度量学习相媲美。进一步地,我们设置了两种额外的实验场景,其中用户具有不同偏好:基于单标签的图像检索和条件图像检索。在这两种情况下,我们的检索系统均能有效适应用户的偏好,从而比无反馈的图像检索获得更高精度。总体而言,我们的工作凸显了将CLIP与经典相关性反馈技术相结合以增强图像检索的潜在优势。