Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research most often evaluates algorithms in specialized libraries, making optimistic assumptions that do not align with enterprise-grade database systems. Our work challenges this premise by demonstrating that in a production-grade database system, commonly made assumptions do not hold, leading to performance characteristics and algorithmic trade-offs that are fundamentally different from those observed in isolated library settings. This paper presents the first in-depth analysis of filter-agnostic FVS algorithms within a production PostgreSQL-compatible system. We systematically evaluate post-filtering and inline-filtering strategies across a wide range of selectivities and correlations. Our central finding is that the optimal algorithm is not dictated by the cost of distance computations alone, but that system-level overheads that come from both distance computations and filter operations (like page accesses and data retrieval) play a significant role. We demonstrate that graph-based approaches (such as NaviX/ACORN) can incur prohibitive numbers of filter checks and system-level overheads, compared with clustering-based indexes such as ScaNN, often canceling out their theoretical benefits in real-world database environments. Ultimately, our findings provide the database community with crucial insights and practical guidelines, demonstrating that the optimal choice for a filter-agnostic FVS algorithm is not absolute, but rather a system-aware decision contingent on the interplay between workload characteristics and the underlying costs of data access in a real-world database architecture.
翻译:滤波向量搜索是现代数据库系统中支持语义搜索和生成式AI应用的核心技术。然而,现有研究通常在专用算法库中评估算法性能,其乐观假设与企业级数据库系统并不吻合。本研究通过证明在生产级数据库系统中,常见假设不再成立,导致性能特征和算法权衡与孤立库环境下的观察结果根本不同,从而挑战了这一前提。本文首次在生产级PostgreSQL兼容系统中对滤波器无关的滤波向量搜索算法进行了深入分析,系统评估了在不同选择性和相关性条件下的后滤波与内联滤波策略。核心发现是:最优算法并非由距离计算成本单独决定,而是距离计算与滤波操作(如页面访问和数据检索)带来的系统层面开销共同发挥作用。研究表明,与ScaNN等基于聚类的索引相比,基于图的方法(如NaviX/ACORN)可能产生过多的滤波检查次数和系统层面开销,在真实数据库环境中往往抵消其理论优势。最终,本研究为数据库社区提供了关键见解和实用指南,证明滤波器无关滤波向量搜索算法的最优选择并非绝对,而是取决于工作负载特性与真实数据库架构中数据访问底层成本之间相互作用的系统感知决策。