Imbalanced data, characterized by an unequal distribution of data points across different clusters, poses a challenge for traditional hard and fuzzy clustering algorithms, such as hard K-means (HKM, or Lloyd's algorithm) and fuzzy K-means (FKM, or Bezdek's algorithm). This paper introduces equilibrium K-means (EKM), a novel and simple K-means-type algorithm that alternates between just two steps, yielding significantly improved clustering results for imbalanced data by reducing the tendency of centroids to crowd together in the center of large clusters. We also present a unifying perspective for HKM, FKM, and EKM, showing they are essentially gradient descent algorithms with an explicit relationship to Newton's method. EKM has the same time and space complexity as FKM but offers a clearer physical meaning for its membership definition. We illustrate the performance of EKM on two synthetic and ten real datasets, comparing it to various clustering algorithms, including HKM, FKM, maximum-entropy fuzzy clustering, two FKM variations designed for imbalanced data, and the Gaussian mixture model. The results demonstrate that EKM performs competitively on balanced data while significantly outperforming other techniques on imbalanced data. For high-dimensional data clustering, we demonstrate that a more discriminative representation can be obtained by mapping high-dimensional data via deep neural networks into a low-dimensional, EKM-friendly space. Deep clustering with EKM improves clustering accuracy by 35% on an imbalanced dataset derived from MNIST compared to deep clustering based on HKM.
翻译:不均衡数据(即不同簇中数据点分布不均)对传统硬聚类与模糊聚类算法(如硬K-means(HKM,即Lloyd算法)与模糊K-means(FKM,即Bezdek算法))构成挑战。本文提出均衡K-means(EKM),一种新型简洁的K-means型算法,仅需交替执行两个步骤,通过降低质心向大簇中心聚集的趋势,显著提升不均衡数据的聚类效果。我们同时提出HKM、FKM与EKM的统一视角,表明它们在本质上均为梯度下降算法,且与牛顿法存在显式关联。EKM在时间与空间复杂度上与FKM相同,但其隶属度定义具有更清晰的物理意义。我们通过两个合成数据集与十个真实数据集验证EKM性能,并与包括HKM、FKM、最大熵模糊聚类、两种面向不均衡数据的FKM变体以及高斯混合模型在内的多种聚类算法进行比较。结果表明:EKM在均衡数据上表现具竞争力,而在不均衡数据上显著优于其他方法。针对高维数据聚类,我们证明通过深度神经网络将高维数据映射至低维EKM友好空间,可获得更具判别力的表示。基于MNIST生成的不均衡数据集,相较于基于HKM的深度聚类,采用EKM的深度聚类使聚类准确率提升35%。