Approximate nearest neighbor search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Prior research has shown that the distance comparison operation is the bottleneck of ANNS, which determines the query and indexing performance. To overcome this challenge, some novel methods have been proposed recently. The basic idea is to estimate the actual distance with fewer calculations, at the cost of accuracy loss. Inspired by this, we also propose that some classical techniques and deep learning models can also be adapted to this purpose. In this paper, we systematically categorize the techniques that have been or can be used to accelerate distance approximation. And to help the users understand the pros and cons of different techniques, we design a fair and comprehensive benchmark, Fudist implements these techniques with the same base index and evaluates them on 16 real datasets with several evaluation metrics. Designed as an independent and portable library, Fudist is orthogonal to the specific index structure and thus can be easily utilized in the current ANNS library to achieve significant improvements.
翻译:在高维向量上的近似最近邻搜索(ANNS)已成为各种机器学习任务中基础且关键的组成部分。先前的研究表明,距离比较操作是ANNS的瓶颈,它决定了查询和索引的性能。为了克服这一挑战,近期提出了一些新颖方法。其基本思想是以精度损失为代价,通过更少的计算量来估算实际距离。受此启发,我们还提出,某些经典技术和深度学习模型也可适用于此目的。在本文中,我们对已用于或可用于加速距离近似的技术进行了系统分类。为了帮助用户理解不同技术的优缺点,我们设计了一个公平且全面的基准测试——Fudist,它在相同基础索引上实现了这些技术,并使用多种评估指标在16个真实数据集上进行了评估。作为独立且可移植的库,Fudist与特定索引结构正交,因此可以轻松集成到现有ANNS库中,以实现显著性能提升。