Partitioning trees are efficient data structures for $k$-nearest neighbor search. Machine learning libraries commonly use a special type of partitioning trees called $k$d-trees to perform $k$-nn search. Unfortunately, $k$d-trees can be ineffective in high dimensions because they need more tree levels to decrease the vector quantization (VQ) error. Random projection trees rpTrees solve this scalability problem by using random directions to split the data. A collection of rpTrees is called rpForest. $k$-nn search in an rpForest is influenced by two factors: 1) the dispersion of points along the random direction and 2) the number of rpTrees in the rpForest. In this study, we investigate how these two factors affect the $k$-nn search with varying $k$ values and different datasets. We found that with larger number of trees, the dispersion of points has a very limited effect on the $k$-nn search. One should use the original rpTree algorithm by picking a random direction regardless of the dispersion of points.
翻译:划分树是用于$k$-近邻搜索的高效数据结构。机器学习库通常使用一种称为$k$d-树的特殊划分树来执行$k$-近邻搜索。然而,$k$d-树在高维空间中可能效率低下,因为它需要更多的树层级来降低向量量化误差。随机投影树(rpTrees)通过使用随机方向分割数据解决了这一可扩展性问题。一组rpTrees称为随机投影森林(rpForest)。在rpForest中进行$k$-近邻搜索受到两个因素的影响:1) 点沿随机方向的分散性;2) rpForest中rpTrees的数量。在本研究中,我们探讨了这两个因素如何在不同$k$值和不同数据集下影响$k$-近邻搜索。我们发现,当树的数量较大时,点的分散性对$k$-近邻搜索的影响非常有限。应使用原始rpTree算法,即无论点的分散性如何,都选择一个随机方向。