Graph-based Approximate Nearest Neighbor (ANN) search often suffers from performance degradation in high-dimensional spaces due to the Euclidean-Geodesic mismatch, where greedy routing diverges from the underlying data manifold. To address this challenge, we propose Manifold-Consistent Graph Indexing (MCGI), a geometry-aware and disk-resident indexing method that leverages Local Intrinsic Dimensionality (LID) to dynamically adapt search strategies to the intrinsic geometry of the data. Unlike standard algorithms that treat dimensions uniformly, MCGI modulates its beam search budget based on in situ geometric analysis, eliminating the dependency on static hyperparameters. Theoretical analysis confirms that MCGI provides robust approximation guarantees by preserving manifold-consistent topological connectivity. Extensive evaluations against three industry-standard baselines across five datasets, ranging from million to billion scales, demonstrate the superiority of our approach. Empirically, MCGI achieves 5.8x higher throughput at 95\% recall on the high-dimensional GIST1M dataset compared to the state-of-the-art DiskANN. On the billion-scale SIFT1B and T2I-1B datasets, MCGI further validates its scalability by reducing high-recall query latency by 3x, while maintaining performance parity on standard lower-dimensional benchmarks.
翻译:基于图的近似最近邻搜索在高维空间中常因欧几里得-测地线失配问题而性能下降,即贪婪路由偏离底层数据流形。为解决这一挑战,我们提出流形一致图索引,这是一种几何感知的磁盘驻留索引方法,利用局部本征维度动态调整搜索策略以适应数据的内在几何结构。与将维度统一处理的标准算法不同,MCGI通过原位几何分析调节其束搜索预算,消除了对静态超参数的依赖。理论分析证实,MCGI通过保持流形一致的拓扑连通性提供了鲁棒的近似保证。在五个百万至十亿规模数据集上对三种工业标准基线的广泛评估证明了本方法的优越性。实验表明,在高维GIST1M数据集上,MCGI在95%召回率下比最先进的DiskANN实现了5.8倍的吞吐量提升。在十亿规模的SIFT1B和T2I-1B数据集上,MCGI进一步验证了其可扩展性,将高召回查询延迟降低3倍,同时在标准低维基准测试中保持性能相当。