Centroid-based clustering algorithms, such as hard K-means (HKM) and fuzzy K-means (FKM), have suffered from learning bias towards large clusters. Their centroids tend to be crowded in large clusters, compromising performance when the true underlying data groups vary in size (i.e., imbalanced data). To address this, we propose a new clustering objective function based on the Boltzmann operator, which introduces a novel centroid repulsion mechanism, where data points surrounding the centroids repel other centroids. Larger clusters repel more, effectively mitigating the issue of large cluster learning bias. The proposed new algorithm, called equilibrium K-means (EKM), is simple, alternating between two steps; resource-saving, with the same time and space complexity as FKM; and scalable to large datasets via batch learning. We substantially evaluate the performance of EKM on synthetic and real-world datasets. The results show that EKM performs competitively on balanced data and significantly outperforms benchmark algorithms on imbalanced data. Deep clustering experiments demonstrate that EKM is a better alternative to HKM and FKM on imbalanced data as more discriminative representation can be obtained. Additionally, we reformulate HKM, FKM, and EKM in a general form of gradient descent and demonstrate how this general form facilitates a uniform study of K-means algorithms.
翻译:基于质心的聚类算法,如硬K均值(HKM)和模糊K均值(FKM),一直存在对大聚类簇的学习偏差问题。这些算法的质心倾向于聚集在大聚类簇中,当真实数据组的大小存在差异(即非均衡数据)时,会降低性能。为解决此问题,我们提出了一种基于玻尔兹曼算子的新聚类目标函数,该函数引入了一种新颖的质心排斥机制,即质心周围的数据点会排斥其他质心。较大的聚类簇产生更强的排斥作用,有效缓解了大聚类簇的学习偏差问题。所提出的新算法称为平衡K均值(EKM),其结构简单,在两个步骤之间交替进行;资源高效,时间与空间复杂度与FKM相同;通过批量学习可扩展到大规模数据集。我们在合成数据集和真实数据集上全面评估了EKM的性能。结果表明,EKM在均衡数据上表现具有竞争力,并在非均衡数据上显著优于基准算法。深度聚类实验证明,EKM在非均衡数据上是HKM和FKM的更优替代方案,可获得更具判别性的表征。此外,我们将HKM、FKM和EKM重新表述为梯度下降的一般形式,并展示了这一一般形式如何促进对K均值算法的统一研究。