The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global $k$-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to $K$ clusters by solving all $k$-means sub-problems incrementally for all $k=1,\ldots, K$. For each $k$ cluster problem, the method executes the $k$-means algorithm $N$ times, where $N$ is the number of datapoints. In this paper, we propose the \emph{global $k$-means\texttt{++}} clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global $k$-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the $k$-means\texttt{++} algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.
翻译:$k$-均值算法因其简单性、有效性和速度而成为一种流行的聚类方法。然而,其主要缺点是对聚类中心初始位置的高度敏感性。全局 $k$-均值是一种确定性算法,旨在解决 $k$-均值的随机初始化问题,但众所周知其计算成本较高。该算法通过逐步求解所有 $k=1,\ldots, K$ 的子问题,将数据划分为 $K$ 个簇。对于每个 $k$ 簇问题,该方法需运行 $k$-均值算法 $N$ 次,其中 $N$ 是数据点的数量。在本文中,我们提出了 \emph{全局 $k$-均值\texttt{++}} 聚类算法,该算法能够在降低计算负荷的同时,获得与全局 $k$-均值相当的高质量聚类解。这是通过利用 $k$-均值\texttt{++} 算法中有效使用的中心选择概率来实现的。所提出的方法已在各种基准数据集上进行了测试和比较,在聚类质量和执行速度方面取得了非常令人满意的结果。