We introduce the breathing k-means algorithm, which significantly improves upon the widely-known greedy k-means++ algorithm, the default method for k-means clustering in the scikit-learn package. Our approach is able to improve solutions obtained by greedy k-means++ through a novel 'breathing' technique cyclically increasing and decreasing the number of centroids based on local error and utility measures. We conducted experiments using greedy k-means++ as a baseline, comparing it with breathing k-means and five other k-means algorithms. Among the methods investigated, only breathing k-means and better k-means++ consistently outperformed the baseline, with breathing k-means demonstrating a substantial lead. This superior performance was maintained even when comparing the best result of ten runs for all other algorithms to a single run of breathing k-means, demonstrating its effectiveness and speed. Our findings indicate that the breathing k-means algorithm outperforms the other k-means techniques, especially greedy k-means++ with ten repetitions, which it dominates in both solution quality and speed. This positions breathing k-means as a full replacement for greedy k-means++.
翻译:本文提出了呼吸K-Means算法,该算法显著改进了广泛使用的贪婪K-Means++算法(scikit-learn包中K-Means聚类的默认方法)。我们的方法能够通过一种新颖的"呼吸"技术来改进贪婪K-Means++获得的解,该技术基于局部误差和效用度量循环地增加和减少质心数量。我们以贪婪K-Means++为基线进行了实验,将其与呼吸K-Means及其他五种K-Means算法进行比较。在所有研究的方法中,只有呼吸K-Means和改进型K-Means++始终优于基线,其中呼吸K-Means表现出显著优势。即使将其他所有算法十次运行的最佳结果与呼吸K-Means的单次运行结果进行比较,这种优越性能仍然得以保持,证明了其有效性和速度。我们的研究结果表明,呼吸K-Means算法优于其他K-Means技术,特别是十次重复的贪婪K-Means++,在解质量和速度方面均占据优势。这使呼吸K-Means成为贪婪K-Means++的完整替代方案。