We introduce the breathing k-means algorithm, which on average significantly improves solutions obtained by the widely-known greedy k-means++ algorithm, the default method for k-means clustering in the scikit-learn package. The improvements are achieved through a novel ``breathing'' technique, that cyclically increases and decreases the number of centroids based on local error and utility measures. We conducted experiments using greedy k-means++ as a baseline, comparing it with breathing k-means and five other k-means algorithms. Among the methods investigated, only breathing k-means and better k-means++ consistently outperformed the baseline, with breathing k-means demonstrating a substantial lead. This superior performance was maintained even when comparing the best result of ten runs for all other algorithms to a single run of breathing k-means, highlighting its effectiveness and speed. Our findings indicate that the breathing k-means algorithm outperforms the other k-means techniques, especially greedy k-means++ with ten repetitions, which it dominates in both solution quality and speed. This positions breathing k-means (with the built-in initialization by a single run of greedy k-means++) as a superior alternative to running greedy k-means++ on its own.
翻译:我们提出了呼吸K均值算法,该算法平均显著改进了广为人知的贪婪K均值++算法(scikit-learn包中K均值聚类的默认方法)所获得的解。这一改进是通过一种新颖的“呼吸”技术实现的,该技术基于局部误差和效用度量循环地增加和减少质心数量。我们以贪婪K均值++为基线进行了实验,将其与呼吸K均值及其他五种K均值算法进行了比较。在所研究的方法中,只有呼吸K均值和更好的K均值++始终优于基线,其中呼吸K均值表现出显著的优势。即使将所有其他算法十次运行中的最佳结果与呼吸K均值单次运行的结果进行比较,其优越性能依然得以保持,突显了其高效性和速度。我们的研究结果表明,呼吸K均值算法优于其他K均值技术,特别是十次重复的贪婪K均值++,在解的质量和速度上均占主导地位。这使得呼吸K均值(通过单次贪婪K均值++运行进行内置初始化)成为独立运行贪婪K均值++的优越替代方案。