Approximate nearest neighbor (ANN) search on SSD-backed indexes is increasingly I/O-bound (I/O accounts for 70--90\% of query latency). We present an I/O-first framework for disk-based ANN that organizes techniques along three dimensions: memory layout, disk layout, and search algorithm. We introduce a page-level complexity model that explains how page locality and path length jointly determine page reads, and we validate the model empirically. Using consistent implementations across four public datasets, we quantify both single-factor effects and cross-dimensional synergies. We find that (i) memory-resident navigation and dynamic width provide the strongest standalone gains; (ii) page shuffle and page search are weak alone but complementary together; and (iii) a principled composition, OctopusANN, substantially reduces I/O and achieves 4.1--37.9\% higher throughput than the state-of-the-art system Starling and 87.5--149.5\% higher throughput than DiskANN at matched Recall@10=90\%. Finally, we distill actionable guidelines for selecting storage-centric or hybrid designs across diverse concurrency levels and accuracy constraints, advocating systematic composition rather than isolated tweaks when pushing the performance frontier of disk-based ANN.
翻译:在SSD支持的索引上进行近似最近邻(ANN)搜索日益成为I/O密集型操作(I/O占查询延迟的70-90%)。我们提出一个面向磁盘ANN的I/O优先框架,该框架从三个维度组织技术:内存布局、磁盘布局和搜索算法。我们引入一个页面级复杂度模型,解释页面局部性和路径长度如何共同决定页面读取次数,并通过实验验证该模型。基于四个公共数据集的一致实现,我们量化了单因素效应和跨维度协同效应。我们发现:(i) 内存驻留导航和动态宽度提供最强的独立增益;(ii) 页面混洗和页面搜索单独较弱,但组合互补;(iii) 一种原则性的组合方法OctopusANN显著减少I/O,在Recall@10=90%条件下,吞吐量比最先进系统Starling高4.1-37.9%,比DiskANN高87.5-149.5%。最后,我们提炼出针对不同并发级别和精度约束选择存储中心或混合设计的可操作指南,主张在推动磁盘ANN性能前沿时采用系统性组合而非孤立调整。