We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at https://github.com/selotape/distributed_k_means.
翻译:我们提出了一种面向分布式环境的新型k-means聚类算法。在该分布式场景中,数据分布在多台机器上,由协调器与这些机器进行通信来计算最终聚类结果。该算法能够保证成本近似因子,且通信轮次仅取决于协调器的计算能力。此外,算法内置了自适应停止机制,可在条件允许时使用更少的通信轮次。我们从理论和实验两个层面证明,在多数自然场景下,实际仅需1-4轮通信即可完成聚类。与流行的k-means||算法相比,我们的方法能够通过利用更大的协调器容量来减少通信轮次。实验表明,即使允许k-means||算法使用更多轮次,本算法获得的k-means成本通常更优。更重要的是,本算法的单机运行时间显著低于k-means||算法。算法实现代码及实验脚本已开源至https://github.com/selotape/distributed_k_means。