DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph Index using Query-sensitivity Entry Vertex

Given a vector dataset $\mathcal{X}$ and a query vector $\vec{x}_q$, graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph index $G$ and approximately return vectors with minimum distances to $\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is that a graph index would be too large to fit into the memory especially for a large-scale $\mathcal{X}$. To solve this, a Product Quantization (PQ)-based hybrid method called DiskANN is proposed to store a low-dimensional PQ index in memory and retain a graph index in SSD, thus reducing memory overhead while ensuring a high search accuracy. However, it suffers from two I/O issues that significantly affect the overall efficiency: (1) long routing path from an entry vertex to the query's neighborhood that results in large number of I/O requests and (2) redundant I/O requests during the routing process. We propose an optimized DiskANN++ to overcome above issues. Specifically, for the first issue, we present a query-sensitive entry vertex selection strategy to replace DiskANN's static graph-central entry vertex by a dynamically determined entry vertex that is close to the query. For the second I/O issue, we present an isomorphic mapping on DiskANN's graph index to optimize the SSD layout and propose an asynchronously optimized Pagesearch based on the optimized SSD layout as an alternative to DiskANN's beamsearch. Comprehensive experimental studies on eight real-world datasets demonstrate our DiskANN++'s superiority on efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to DiskANN, given the same accuracy constraint.

翻译：给定向量数据集 $\mathcal{X}$ 和查询向量 $\vec{x}_q$，基于图的近似最近邻搜索旨在构建图索引 $G$，并通过在 $G$ 上搜索近似返回与 $\vec{x}_q$ 距离最小的向量。基于图的ANNS的主要缺陷是，图索引可能过大而无法装入内存，尤其对于大规模 $\mathcal{X}$。为解决此问题，提出了一种基于乘积量化（PQ）的混合方法DiskANN，该方法在内存中存储低维PQ索引，同时在SSD中保留图索引，从而在保证高搜索精度的同时降低内存开销。然而，该方法存在两个严重影响整体效率的I/O问题：(1) 从入口顶点到查询邻域的长路由路径导致大量I/O请求；(2) 路由过程中的冗余I/O请求。我们提出优化版DiskANN++以克服上述问题。具体而言，针对第一个问题，我们提出一种查询敏感的入口顶点选择策略，用动态确定的靠近查询的入口顶点替代DiskANN的静态图中心入口顶点。针对第二个I/O问题，我们提出对DiskANN图索引进行同构映射以优化SSD布局，并基于优化后的SSD布局提出异步优化的Pagesearch作为DiskANN光束搜索的替代方案。在八个真实世界数据集上的综合实验研究表明，我们的DiskANN++在效率上具有优越性。在相同精度约束下，与DiskANN相比，我们在QPS上实现了1.5倍至2.2倍的显著提升。