High-dimensional vector similarity search (HVSS) is receiving a spotlight as a powerful tool for various data science and AI applications. As vector data grows larger, in-memory indexes become extremely expensive because they necessitate substantial expansion of main memory resources. One possible solution is to use disk-based implementation, which stores and searches vector data in high-performance devices like NVMe SSDs. However, HVSS for data segments is still challenging in vector databases, where one machine has multiple segments for system features (like scaling) purposes. In this setting, each segment has limited memory and disk space, so HVSS on the data segment needs to balance accuracy, efficiency, and space cost. Existing disk-based methods are sub-optimal because they do not consider all these requirements together. In this paper, we present Starling, an I/O-efficient disk-resident graph index framework that optimizes data layout and search strategy in the segment. It has two main components: (1) a data layout that includes an in-memory navigation graph and a reordered disk-based graph with locality enhancement, which reduces the search path length and disk bandwidth wastage; and (2) a block search strategy that minimizes expensive disk I/Os when executing a vector query. We conduct extensive experiments to verify Starling's effectiveness, efficiency, and scalability. On a data segment with 2GB memory and 10GB disk capacity, Starling can maintain up to 33 million vectors in 128 dimensions, and serve HVSS with more than 0.9 average precision and top-10 recall rate, and latency of under 1 millisecond. The results show that Starling exhibits 43.9$\times$ higher throughput with 98% lower query latency than state-of-the-art methods under the same accuracy.
翻译:高维向量相似性搜索(HVSS)正成为各类数据科学与人工智能应用中备受瞩目的强大工具。随着向量数据规模日益增长,内存索引因需要大幅扩展主存资源而变得极为昂贵。一种可行的解决方案是采用基于磁盘的实现方式,即在NVMe固态硬盘等高性能设备中存储与搜索向量数据。然而,在向量数据库中,针对数据段的HVSS仍面临挑战:单台机器可能因系统特性(如扩展性)而包含多个数据段,每个数据段的内存与磁盘空间均有限,因此数据段上的HVSS需在准确性、效率与空间成本之间取得平衡。现有基于磁盘的方法未能综合考虑所有这些需求,因而无法实现最优性能。本文提出Starling——一种面向数据段的I/O高效磁盘驻留图索引框架,通过优化数据布局与搜索策略实现性能提升。其核心包含两大组件:(1)数据布局模块,采用内存导航图与基于局部性优化重排的磁盘驻留图,可缩短搜索路径长度并减少磁盘带宽浪费;(2)块搜索策略,在执行向量查询时最小化昂贵的磁盘I/O操作。我们通过大量实验验证了Starling的有效性、效率与可扩展性。在内存容量2GB、磁盘容量10GB的数据段上,Starling可维护多达3300万个128维向量,并以平均精度超0.9、top-10召回率超0.9、延迟低于1毫秒的性能指标支持HVSS。实验结果表明:在同等精度条件下,Starling的吞吐量较现有最优方法提升43.9倍,查询延迟降低98%。