Given a vector dataset $\mathcal{X}$ and a query vector $\vec{x}_q$, graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph index $G$ and approximately return vectors with minimum distances to $\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is that a graph index would be too large to fit into the memory especially for a large-scale $\mathcal{X}$. To solve this, a Product Quantization (PQ)-based hybrid method called DiskANN is proposed to store a low-dimensional PQ index in memory and retain a graph index in SSD, thus reducing memory overhead while ensuring a high search accuracy. However, it suffers from two I/O issues that significantly affect the overall efficiency: (1) long routing path from an entry vertex to the query's neighborhood that results in large number of I/O requests and (2) redundant I/O requests during the routing process. We propose an optimized DiskANN++ to overcome above issues. Specifically, for the first issue, we present a query-sensitive entry vertex selection strategy to replace DiskANN's static graph-central entry vertex by a dynamically determined entry vertex that is close to the query. For the second I/O issue, we present an isomorphic mapping on DiskANN's graph index to optimize the SSD layout and propose an asynchronously optimized Pagesearch based on the optimized SSD layout as an alternative to DiskANN's beamsearch. Comprehensive experimental studies on eight real-world datasets demonstrate our DiskANN++'s superiority on efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to DiskANN, given the same accuracy constraint.
翻译:给定向量数据集 $\mathcal{X}$ 和查询向量 $\vec{x}_q$,基于图的近似最近邻搜索(ANNS)旨在构建图索引 $G$,并通过在 $G$ 上搜索近似返回与 $\vec{x}_q$ 距离最小的向量。基于图的ANNS的主要缺点是,对于大规模 $\mathcal{X}$,图索引可能过大而无法装入内存。为解决此问题,提出了基于乘积量化(PQ)的混合方法DiskANN,其在内存中存储低维PQ索引,在SSD中保留图索引,从而在保证高搜索精度的同时降低内存开销。然而,该方法存在两个显著影响整体效率的I/O问题:(1)从入口顶点到查询邻域的长路由路径导致大量I/O请求,以及(2)路由过程中的冗余I/O请求。我们提出优化的DiskANN++以克服上述问题。具体而言,针对第一个问题,我们提出一种查询感知的入口顶点选择策略,用动态确定的接近查询的入口顶点替代DiskANN的静态图中心入口顶点。针对第二个I/O问题,我们提出在DiskANN的图索引上采用同构映射以优化SSD布局,并提出基于优化SSD布局的异步优化页搜索(Pagesearch)作为DiskANN波束搜索(beamsearch)的替代方案。在八个真实世界数据集上的综合实验研究表明,我们的DiskANN++在效率上具有优越性。在相同精度约束下,与DiskANN相比,我们在每秒查询数(QPS)上实现了1.5倍至2.2倍的显著提升。