The k-means clustering algorithm is a popular algorithm that partitions data into k clusters. There are many improvements to accelerate the standard algorithm. Most current research employs upper and lower bounds on point-to-cluster distances and the triangle inequality to reduce the number of distance computations, with only arrays as underlying data structures. These approaches cannot exploit that nearby points are likely assigned to the same cluster. We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree. By combining this with upper and lower bounds, as in state-of-the-art approaches, we obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.
翻译:k均值聚类算法是一种将数据划分为k个簇的常用算法。现有多种改进方法用于加速标准算法。当前大多数研究通过点与簇距离的上下界约束及三角不等式来减少距离计算次数,其底层数据结构仅采用数组形式。此类方法无法利用邻近点可能被分配至同一簇的特性。本文提出一种基于覆盖树索引的新型k均值算法,该算法具有相对较低的开销,且在更广泛的参数范围内相比基于k-d树的现有方法表现更优。通过将其与当前最优方法中的上下界技术相结合,我们构建出一种混合算法,该算法融合了树结构聚合与基于边界的过滤机制的双重优势。