Approximate nearest neighbour (ANN) search has become a central task in modern data-intensive applications, particularly when operating on large, heterogeneous, or high-dimensional datasets. However, many existing ANN methods struggle in such scenarios, either because they rely on metric assumptions or because their indexing strategies are not well suited to distributed environments or to settings with constrained memory resources. This work introduces PDASC (Parametrizable Distributed Approximate Similarity Search with Clustering), a distributed ANN search algorithm whose index design simultaneously supports arbitrary dissimilarity functions and efficient deployment in distributed, storage-aware environments. PDASC builds a distributed hierarchical index based on clustering mechanisms that are agnostic to distance properties, thereby accommodating non-metric and domain-specific similarities while naturally partitioning indexing and search across multiple computing nodes, with a compact per-node memory footprint. By preserving locally informative neighbourhood structure, the proposed index mitigates practical manifestations of the curse of dimensionality in high-dimensional spaces. We analyse how the index structural parameters govern the trade-offs among recall, computational cost, and memory usage. Experimental evaluation across multiple benchmark datasets and distance functions shows that PDASC achieves competitive accuracy-efficiency trade-offs while consistently requiring lower per-node memory compared to state-of-the-art ANN methods. By avoiding reliance on specialised hardware acceleration, PDASC enables scalable and energy-efficient similarity search in heterogeneous and distributed settings where memory efficiency and distance-function flexibility are first-class constraints.
翻译:近似最近邻(ANN)搜索已成为现代数据密集型应用中的核心任务,特别是在处理大规模、异构或高维数据集时。然而,许多现有ANN方法在此类场景中面临困难,原因在于它们要么依赖于度量假设,要么其索引策略不适用于分布式环境或内存资源受限的场景。本文提出PDASC(参数化分布式近似相似性搜索与聚类),这是一种分布式ANN搜索算法,其索引设计同时支持任意相异性函数以及在分布式、存储感知环境中的高效部署。PDASC基于对距离属性无感知的聚类机制构建分布式层次索引,从而能够适应非度量及领域特定的相似性度量,同时自然地将索引构建与搜索过程划分到多个计算节点,并保持紧凑的单节点内存占用。通过保留局部信息化的邻域结构,所提出的索引缓解了高维空间中维度灾难的实际表现。我们分析了索引结构参数如何权衡召回率、计算成本与内存使用之间的关系。在多个基准数据集和距离函数上的实验评估表明,PDASC在实现具有竞争力的准确率-效率权衡的同时,相比最先进的ANN方法始终要求更低的单节点内存。通过避免依赖专用硬件加速,PDASC能够在内存效率和距离函数灵活性为首要约束的异构分布式环境中实现可扩展且高能效的相似性搜索。