In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.
翻译:本文提出一种利用几何信息加速精确k-means++算法的方法,具体结合了三角不等式与附加范数过滤器,并采用两步采样策略。实验表明,加速版本在访问点数与距离计算量方面均优于标准k-means++算法,且随着聚类数量的增加获得更显著的加速效果。其中,采用三角不等式的版本在低维数据上表现尤为突出,而基于范数的附加过滤器能有效提升在点间范数方差较大的高维实例中的性能。补充实验展示了算法在多任务并行执行时的表现,并探讨了内存性能对实际加速效果的影响。