K-Means algorithm is a popular clustering method. However, it has two limitations: 1) it gets stuck easily in spurious local minima, and 2) the number of clusters k has to be given a priori. To solve these two issues, a multi-prototypes convex merging based K-Means clustering algorithm (MCKM) is presented. First, based on the structure of the spurious local minima of the K-Means problem, a multi-prototypes sampling (MPS) is designed to select the appropriate number of multi-prototypes for data with arbitrary shapes. A theoretical proof is given to guarantee that the multi-prototypes selected by MPS can achieve a constant factor approximation to the optimal cost of the K-Means problem. Then, a merging technique, called convex merging (CM), merges the multi-prototypes to get a better local minima without k being given a priori. Specifically, CM can obtain the optimal merging and estimate the correct k. By integrating these two techniques with K-Means algorithm, the proposed MCKM is an efficient and explainable clustering algorithm for escaping the undesirable local minima of K-Means problem without given k first. Experimental results performed on synthetic and real-world data sets have verified the effectiveness of the proposed algorithm.
翻译:K-Means算法是一种常用的聚类方法,但存在两个局限性:1)易陷入虚假局部极小值;2)聚类数k需预先给定。为解决这两个问题,提出了一种基于多原型凸合并的K-Means聚类算法(MCKM)。首先,基于K-Means问题虚假局部极小值的结构,设计了一种多原型采样(MPS)方法,为任意形状数据选择适当数量的多原型。理论证明表明,MPS选取的多原型能够实现对K-Means问题最优解的常数因子近似。随后,提出一种称为凸合并(CM)的合并技术,可在无需预先给定k的前提下合并多原型以获得更优的局部极小值。具体而言,CM能够实现最优合并并估计正确的k值。通过将这两种技术与K-Means算法相结合,所提出的MCKM是一种高效且可解释的聚类算法,能够在无需预先指定k的情况下避开K-Means问题的不良局部极小值。在合成数据集和真实数据集上的实验结果验证了所提算法的有效性。