Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as filtered approximate nearest neighbor search (FANNS). By performing an in-depth literature analysis on FANNS, we identify a key gap in the research landscape: publicly available datasets with embedding vectors from state-of-the-art transformer-based text embedding models that contain abundant real-world attributes covering a broad spectrum of attribute types and value distributions. To fill this gap, we introduce the arxiv-for-fanns dataset of transformer-based embedding vectors for the abstracts of over 2.7 million arXiv papers, enriched with 11 real-world attributes such as authors and categories. We benchmark eleven different FANNS methods on our new dataset to evaluate their performance across different filter types, numbers of retrieved neighbors, dataset scales, and query selectivities. We distill our findings into eight key observations that guide users in selecting the most suitable FANNS method for their specific use cases.
翻译:文本、图像、音频和视频的嵌入模型进步推动了多个领域的发展,包括检索增强生成、推荐系统等。这些应用通常需要在嵌入空间中高效检索与给定查询接近的项,同时满足基于项属性的过滤条件,这一问题被称为过滤近似最近邻搜索(FANNS)。通过对FANNS进行深入的文献分析,我们识别出研究领域的一个关键空白:缺乏基于最先进Transformer文本嵌入模型生成的、包含丰富真实世界属性(涵盖广泛属性类型和值分布)的公开数据集。为填补这一空白,我们引入了arxiv-for-fanns数据集,该数据集包含超过270万篇arXiv论文摘要的Transformer嵌入向量,并补充了11种真实世界属性(如作者和类别)。我们在新数据集上对11种不同的FANNS方法进行基准测试,评估它们在过滤类型、最近邻数量、数据集规模和查询选择性方面的性能。我们将研究结果提炼为八个关键观察,以指导用户根据具体用例选择最合适的FANNS方法。