The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.
翻译:K-Modes算法作为面向分类数据的聚类方法,具有较高的算法简洁性,但其聚类质量与聚类效率的可靠性均存在不足,这两方面性能在很大程度上受初始聚类中心选择的影响。本文研究了二分K-Modes算法——一种通过连续二分过程寻找聚类的算法,重点考察该二分过程产生的聚类中心作为K-Modes初始中心时的有效性。BK-Modes算法通过迭代将数据集拆分为多个聚类,每次迭代选择一个聚类并将其二分形成两个新聚类。我们采用数据点到其所属聚类中心的距离之和作为选择指标,以确定每次迭代中待二分的聚类。当生成K个聚类时,该迭代过程终止。随后将这K个聚类的中心作为K-Modes的初始聚类中心。我们对BK-Modes进行了实验研究,并与采用多组初始聚类中心的K-Modes算法以及当前文献中表现最优的现有方法进行了对比。实验结果表明,对于大规模数据集,BK-Modes在聚类质量与效率方面均表现出优越性能。