On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.
翻译:基于磁盘的图近似最近邻搜索(ANNS)是大规模高维向量检索的关键技术,但其性能广泛认为受限于高昂的I/O开销。有趣的是,我们注意到随着向量数据维度的提升(例如数百或数千维),基于磁盘的图索引系统的性能受计算约束,而非I/O约束。这一发现揭示了重要的优化空间:现有基于磁盘的图索引系统普遍聚焦于降低I/O,却严重忽视了计算开销,从而留下了巨大的性能提升空间。本文提出AlayaLaser——一种面向大规模高维向量相似性搜索的高效磁盘图索引系统。具体而言,我们首先通过适配的屋顶线模型(Roofline Model)对现有磁盘图索引系统进行性能分析,随后在AlayaLaser中设计了一种新颖的磁盘数据布局,通过利用现代CPU的SIMD指令有效缓解上述屋顶线模型分析所揭示的计算瓶颈问题。我们进一步设计了一系列优化技术(如基于度的节点缓存、基于聚类的入口点选择策略、以及早期分发策略)以提升AlayaLaser性能。最后,我们在广泛的大规模高维向量数据集上进行大量实验研究以验证AlayaLaser的优越性。实验表明,AlayaLaser不仅超越了现有磁盘图索引系统,更能够达到甚至超越内存索引系统的性能水平。