MCGI：面向十亿级磁盘驻留向量搜索的流形一致性图索引 (MCGI: Manifold-Consistent Graph Indexing for Billion-Scale Disk-Resident Vector Search)

Graph-based Approximate Nearest Neighbor (ANN) search often suffers from performance degradation in high-dimensional spaces due to the ``Euclidean-Geodesic mismatch,'' where greedy routing diverges from the underlying data manifold. To address this, we propose Manifold-Consistent Graph Indexing (MCGI), a geometry-aware and disk-resident indexing method that leverages Local Intrinsic Dimensionality (LID) to dynamically adapt search strategies to the data's intrinsic geometry. Unlike standard algorithms that treat dimensions uniformly, MCGI modulates its beam search budget based on in situ geometric analysis, eliminating dependency on static hyperparameters. Theoretical analysis confirms that MCGI enables improved approximation guarantees by preserving manifold-consistent topological connectivity. Empirically, MCGI achieves 5.8$\times$ higher throughput at 95\% recall on high-dimensional GIST1M compared to state-of-the-art DiskANN. On the billion-scale SIFT1B dataset, MCGI further validates its scalability by reducing high-recall query latency by 3$\times$, while maintaining performance parity on standard lower-dimensional datasets.

翻译：基于图的近似最近邻搜索在高维空间中常因"欧几里得-测地线失配"问题导致性能下降，即贪婪路由与底层数据流形发生偏离。为此，我们提出流形一致性图索引（MCGI），这是一种几何感知的磁盘驻留索引方法，利用局部本征维度动态调整搜索策略以适应数据的内在几何结构。与对维度进行均匀处理的标准算法不同，MCGI基于原位几何分析动态调节波束搜索资源，消除了对静态超参数的依赖。理论分析证实，MCGI通过保持流形一致的拓扑连接性，实现了更优的近似保证。实验表明，在高维GIST1M数据集上，MCGI在95%召回率下比当前最优的DiskANN实现了5.8倍的吞吐量提升。在十亿级SIFT1B数据集上，MCGI将高召回查询延迟降低3倍，进一步验证了其可扩展性，同时在标准低维数据集上保持性能相当。