Approximate K-Nearest Neighbor (AKNN) search in high-dimensional spaces is a critical yet challenging problem. The efficiency of AKNN search largely depends on the computation of distances, a process that significantly affects the runtime. To improve computational efficiency, existing work often opts for estimating approximate distances rather than computing exact distances, at the cost of reduced AKNN search accuracy. The recent method of ADSampling has attempted to mitigate this problem by using random projection for distance approximations and adjusting these approximations based on error bounds to improve accuracy. However, ADSampling faces limitations in effectiveness and generality, mainly due to the suboptimality of its distance approximations and its heavy reliance on random projection matrices to obtain error bounds. In this study, we propose a new method that uses an optimal orthogonal projection instead of random projection, thereby providing improved distance approximations. Moreover, our method uses error quantiles instead of error bounds for approximation adjustment, and the derivation of error quantiles can be made independent of the projection matrix, thus extending the generality of our approach. Extensive experiments confirm the superior efficiency and effectiveness of the proposed method. In particular, compared to the state-of-the-art method of ADSampling, our method achieves a speedup of 1.6 to 2.1 times on real datasets with almost no loss of accuracy.
翻译:高维空间中的近似$K$-近邻(AKNN)搜索是一项关键却具有挑战性的问题。AKNN搜索的效率在很大程度上依赖于距离计算,这一过程显著影响运行时间。为了提高计算效率,现有工作通常选择估计近似距离而非精确计算距离,但这会以降低AKNN搜索精度为代价。近期提出的ADSampling方法尝试通过随机投影进行距离近似,并基于误差界调整这些近似值以提升精度,从而缓解该问题。然而,ADSampling在有效性和通用性方面存在局限,这主要源于其距离近似的次优性以及过度依赖随机投影矩阵来获取误差界。在本研究中,我们提出一种新方法,采用最优正交投影替代随机投影,从而提供更优的距离近似。此外,我们的方法使用误差分位数而非误差界进行近似调整,且误差分位数的推导可独立于投影矩阵,从而扩展了方法的通用性。大量实验证实了所提方法在效率和效果上的优越性。特别地,与最先进的ADSampling方法相比,本方法在真实数据集上实现了1.6至2.1倍的加速比,且几乎未损失精度。