The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in developing robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding

翻译：向量搜索系统作为人工智能应用的关键组件，常依赖于层次可导航小世界（HNSW）算法。然而，针对使用深度学习模型生成的真实场景向量，HNSW的实际表现仍缺乏深入探究。现有的近似最近邻（ANN）基准测试与研究通常过度依赖MNIST或SIFT1M等简化数据集，未能反映当前应用场景的复杂性。本研究系统评估HNSW在多种数据集上的效能，包括：为模拟特定内在维度而定制的合成向量、采用主流嵌入模型的常用检索基准数据集，以及基于CLIP模型的电商专有图像数据。我们调研了最主流的HNSW向量数据库，汇总其默认参数以构建贯穿全文的 realistic 固定参数配置。研究发现，相较于精确K最近邻（KNN）搜索，HNSW近似搜索的召回率与向量空间的内在维度相关，并显著受数据插入顺序的影响。我们提出的方法论揭示了如何通过可度量属性（如逐点局部内在维度LID）或已知类别信息来优化插入顺序，从而使召回率产生高达12个百分点的波动。同时发现，在部分模型上使用HNSW替代KNN运行主流基准数据集时，模型排名可能产生最多三个位次的变化。本研究强调，在基于近似向量搜索算法开发鲁棒的向量搜索系统时，需要更精细的基准测试与设计考量。本文通过多个具有不同现实适用性的实验场景，旨在深化对ANN算法与嵌入技术的理解，推动其未来发展。