Determining the number of clusters is a fundamental issue in data clustering. Several algorithms have been proposed, including centroid-based algorithms using the Euclidean distance and model-based algorithms using a mixture of probability distributions. Among these, greedy algorithms for searching the number of clusters by repeatedly splitting or merging clusters have advantages in terms of computation time for problems with large sample sizes. However, studies comparing these methods in systematic evaluation experiments still need to be included. This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate. The cases are generated by combining five factors: dimensionality, sample size, the number of clusters, cluster overlap, and covariance type. The results show that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap. The results also show that model-based algorithms are insensitive to covariance type and cluster overlap compared to the centroid-based method if the sample size is sufficient. Our cluster search implementation codes are available at https://github.com/lipryou/searchClustK
翻译:确定聚类数量是数据聚类中的一个基本问题。已有多种算法被提出,包括基于质心并使用欧氏距离的算法,以及基于模型并使用概率分布混合的算法。其中,通过反复分裂或合并聚类来搜索聚类数量的贪心算法,在处理大样本量问题时具有计算时间上的优势。然而,在系统评估实验中比较这些方法的研究仍有待补充。本研究考察了高斯混合模型(GMMs)可能生成的各种情况下基于质心和基于模型的聚类搜索算法。这些情况通过结合五个因素生成:维度、样本量、聚类数量、聚类重叠度以及协方差类型。结果表明,当聚类存在重叠时,一些基于欧氏距离的聚类分裂准则会做出不合理的决策。结果还表明,如果样本量充足,与基于质心的方法相比,基于模型的算法对协方差类型和聚类重叠度不敏感。我们的聚类搜索实现代码可在 https://github.com/lipryou/searchClustK 获取。