Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.
翻译:基于聚类的近似最近邻搜索(ANNS)通过将点集组织为多个分区,并仅搜索其中少数分区来寻找查询点的最近邻。尽管该方法应用广泛,但目前几乎没有分析工具能够评估基于聚类的ANNS对特定数据集的适用性——我们称之为“可搜索性”。为填补这一空白,本文针对欧氏空间中高维点的平面聚类提出两种度量指标。首先是聚类邻域稳定性度量(clustering-NSM),这是一种聚类质量的内部度量——作为数据集聚类结果的函数——我们证明其能够预测ANNS的准确率。第二种是点邻域稳定性度量(point-NSM),作为数据集本身的函数,该聚类可分离性度量能够预测clustering-NSM。二者结合使我们仅通过数据点即可判断数据集是否适用于基于聚类的ANNS。重要的是,这两种度量均基于点之间的最近邻关系而非距离函数,因此可适用于包括内积在内的多种距离度量方式。