The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times
翻译:文本-图像模型(尤其是CLIP)的出现显著改变了信息检索的格局。这些模型实现了文本与图像等多种模态的融合。CLIP的一个重要成果是使用户能够以文本作为查询来搜索图像,反之亦然。这是通过图像和文本数据的联合嵌入实现的,例如可用于搜索相似项目。尽管存在近似最近邻搜索等高效查询处理技术,检索结果仍可能缺乏精确性和完整性。本文提出CLIP-Branches——一种基于CLIP架构的新型文本-图像搜索引擎。该方法通过引入交互式微调阶段增强了传统文本-图像搜索引擎的功能,允许用户通过迭代定义正例与反例来进一步具体化搜索查询。我们的框架根据额外用户反馈训练分类模型,本质上输出整个数据目录中所有被正向分类的实例。通过借鉴最新技术,该推理阶段并非通过扫描整个数据目录实现,而是利用为数据预构建的高效索引结构。实验结果表明,经过微调的检索结果能在保持快速响应时间的同时,在相关性和准确性方面改善初始搜索输出。