The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data management is to use specialized index structures for fast search over the entirety of the vector embeddings, once combined with other (meta)data, the search queries can also become selective on relational attributes - typical for analytical queries. As using vector indexes differs from traditional relational data access, we revisit and analyze alternative access paths for efficient mixed vector-relational search. We first evaluate the accurate but exhaustive scan-based search and propose hardware optimizations and alternative tensor-based formulation and batching to offset the cost. We outline the complex access-path design space, primarily driven by relational selectivity, and the decisions to consider when selecting an exhaustive scan-based search against an approximate index-based approach. Since the vector index primarily avoids expensive computation across the entire dataset, contrary to the common relational knowledge, it is better to scan at lower selectivity and probe at higher, with a cross-point between the two approaches dictated by data dimensionality and the number of concurrent search queries.
翻译:机器学习能力的快速提升以及使用向量嵌入的数据处理方法的普及,激发了构建向量数据管理系统的极大兴趣。虽然向量数据管理的主流方法是采用专门索引结构来对整体向量嵌入进行快速搜索,但一旦与其他(元)数据结合,搜索查询也可能对关系属性具有选择性——这是分析型查询的典型特征。由于使用向量索引与传统关系型数据访问存在差异,我们重新审视并分析了用于高效混合向量-关系搜索的替代访问路径。我们首先评估了精确但穷举的基于扫描的搜索,并提出了硬件优化以及基于张量的替代公式和批处理方法以抵消计算成本。我们概述了主要由关系选择性驱动的复杂访问路径设计空间,以及在选择穷举扫描式搜索与近似索引式方法时需要考虑的决策。由于向量索引主要避免对整个数据集进行昂贵计算,与传统关系知识相反,在低选择性下执行扫描、在高选择性下执行探测更为有利,两种方法之间的交叉点由数据维度和并发搜索查询数量决定。