Approximate Nearest Neighbor Search (ANNS) is a cornerstone algorithm for information retrieval, recommendation systems, and machine learning applications. While x86-based architectures have historically dominated this domain, the increasing adoption of ARM-based servers in industry presents a critical need for ANNS solutions optimized on ARM architectures. A naive port of existing x86 ANNS algorithms to ARM platforms results in a substantial performance deficit, failing to leverage the unique capabilities of the underlying hardware. To address this challenge, we introduce KScaNN, a novel ANNS algorithm co-designed for the Kunpeng 920 ARM architecture. KScaNN embodies a holistic approach that synergizes sophisticated, data aware algorithmic refinements with carefully-designed hardware specific optimizations. Its core contributions include: 1) novel algorithmic techniques, including a hybrid intra-cluster search strategy and an improved PQ residual calculation method, which optimize the search process at a higher level; 2) an ML-driven adaptive search module that provides adaptive, per-query tuning of search parameters, eliminating the inefficiencies of static configurations; and 3) highly-optimized SIMD kernels for ARM that maximize hardware utilization for the critical distance computation workloads. The experimental results demonstrate that KScaNN not only closes the performance gap but establishes a new standard, achieving up to a 1.63x speedup over the fastest x86-based solution. This work provides a definitive blueprint for achieving leadership-class performance for vector search on modern ARM architectures and underscores
翻译:近似最近邻搜索(ANNS)是信息检索、推荐系统和机器学习应用中的核心算法。尽管x86架构历来主导这一领域,但行业内对ARM服务器的日益采用,使得针对ARM架构优化的ANNS解决方案成为迫切需求。将现有x86 ANNS算法简单移植到ARM平台会导致显著的性能损失,无法充分利用底层硬件的独特能力。为应对这一挑战,我们提出了KScaNN,一种面向鲲鹏920 ARM架构协同设计的新型ANNS算法。KScaNN采用整体方法,将复杂、数据感知的算法改进与精心设计的硬件特定优化相结合。其核心贡献包括:1)新颖的算法技术,包括混合簇内搜索策略和改进的PQ残差计算方法,在更高层次优化搜索过程;2)基于机器学习的自适应搜索模块,提供针对每个查询的自适应搜索参数调优,消除了静态配置的低效性;3)高度优化的ARM SIMD内核,最大化关键距离计算工作负载的硬件利用率。实验结果表明,KScaNN不仅弥合了性能差距,更树立了新标杆,相比基于x86的最快方案实现了高达1.63倍的加速。本研究为在现代ARM架构上实现领先水平的向量搜索提供了明确蓝图。