Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as filtered approximate nearest neighbor search (FANNS). By performing an in-depth literature analysis on FANNS, we identify a key gap in the research landscape: publicly available datasets with embedding vectors from state-of-the-art transformer-based text embedding models that contain abundant real-world attributes covering a broad spectrum of attribute types and value distributions. To fill this gap, we introduce the arxiv-for-fanns dataset of transformer-based embedding vectors for the abstracts of over 2.7 million arXiv papers, enriched with 11 real-world attributes such as authors and categories. We benchmark eleven different FANNS methods on our new dataset to evaluate their performance across different filter types, numbers of retrieved neighbors, dataset scales, and query selectivities. We distill our findings into eight key observations that guide users in selecting the most suitable FANNS method for their specific use cases.
翻译:文本、图像、音频和视频嵌入模型的进展推动了检索增强生成、推荐系统等多个领域的进步。许多此类应用需要一种高效方法,在嵌入空间中检索与给定查询相近且满足基于项目属性的过滤条件的项目,这一问题被称为过滤近似最近邻搜索。通过对FANNS进行深入的文献分析,我们发现研究领域存在一个关键空白:缺乏包含来自最先进基于Transformer的文本嵌入模型的嵌入向量的公开数据集,这些数据集应具备涵盖广泛属性类型与值分布的丰富真实世界属性。为填补这一空白,我们提出了arxiv-for-fanns数据集,该数据集包含超过270万篇arXiv论文摘要的基于Transformer的嵌入向量,并丰富了作者和类别等11个真实世界属性。我们在新数据集上对十一种不同的FANNS方法进行基准测试,以评估它们在不同过滤类型、检索邻居数量、数据集规模和查询选择性下的性能。我们将研究结果提炼为八个关键观察,以指导用户根据具体用例选择最合适的FANNS方法。