Graph-based approaches to approximate nearest neighbor search (ANNS) enable fast, high-recall retrieval on billion-scale vector datasets. Among them, the Sparse Neighborhood Graph (SNG) is widely used due to its strong search performance. However, the lack of theoretical understanding of SNG leads to expensive tuning of the truncation parameter that controls graph sparsification. In this work, we present OPT-SNG, a principled framework for analyzing and optimizing SNG construction. We introduce a martingale-based model of the pruning process that characterizes the stochastic evolution of candidate sets during graph construction. Using this framework, we prove that SNG has a maximum out-degree of \(O(n^{2/3+ε})\), where \(ε>0\) is an arbitrarily small constant, and an expected search path length of \(O(\log n)\). Building on these insights, we derive a closed-form rule for selecting the optimal truncation parameter \(R\), thereby eliminating the need for costly parameter sweeping. Extensive experiments on real-world datasets demonstrate that OPT-SNG achieves an average \(5.9\times\) speedup in index construction time, with peak improvements reaching \(15.4\times\), while consistently maintaining or improving search performance.
翻译:基于图的近似最近邻搜索方法能够在十亿级向量数据集上实现快速、高召回率的检索。其中,稀疏邻域图因其优异的搜索性能而被广泛采用。然而,由于缺乏对SNG的理论理解,控制图稀疏化的截断参数需要昂贵的调优过程。本文提出OPT-SNG,一个用于分析和优化SNG构建的理论框架。我们引入基于鞅的剪枝过程模型,该模型刻画了图构建过程中候选集的随机演化规律。利用该框架,我们证明SNG的最大出度为\(O(n^{2/3+ε})\)(其中\(ε>0\)为任意小常数),且期望搜索路径长度为\(O(\log n)\)。基于这些理论发现,我们推导出选择最优截断参数\(R\)的闭式规则,从而避免了耗时的参数扫描过程。在真实数据集上的大量实验表明,OPT-SNG在索引构建时间上实现了平均5.9倍的加速,峰值提升达到15.4倍,同时持续保持或提升了搜索性能。