On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.
翻译:基于磁盘的图近似最近邻搜索(ANNS)是大规模高维向量检索的核心技术,然而其性能公认受到高昂I/O开销的制约。有趣的是,我们观察到当向量数据维度升高(如数百或数千维)时,基于磁盘的图索引系统的性能瓶颈并非I/O而是计算。这一发现揭示了重要的优化空间:现有基于磁盘的图索引系统普遍专注于I/O缩减而严重忽视了计算开销,这为性能提升留下了广阔空间。本文提出AlayaLaser——一种面向大规模高维向量相似性搜索的高效磁盘图索引系统。具体而言,我们首先通过适配的屋顶线模型对现有磁盘图索引系统进行性能分析,随后针对模型揭示的计算瓶颈,在AlayaLaser中设计了一种新型磁盘数据布局,通过利用现代CPU的SIMD指令有效缓解计算密集型问题。接着我们设计了一系列优化技术(如基于度数的节点缓存、聚类式入口点选择与早期分流策略)以进一步提升AlayaLaser性能。最后,我们在多个大规模高维向量数据集上开展广泛实验,验证了AlayaLaser的优越性。实验表明,AlayaLaser不仅超越现有磁盘图索引系统,更与内存索引系统的性能持平甚至更优。