The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.
翻译:Arthur 和 Vassilvitskii(SODA 2007)提出的 $k$-均值++ 算法通常是实践者优化流行的 $k$-均值聚类目标的首选算法,已知能在期望意义下给出 $O(\log k)$ 近似比。为获得更高质量的解,Lattanzi 和 Sohler(ICML 2019)提出通过 $k$-均值++ 采样分布执行 $O(k \log \log k)$ 步局部搜索来增强 $k$-均值++,从而得到 $k$-均值聚类问题的 $c$ 近似解,其中 $c$ 是一个较大的绝对常数。本文通过考虑更大且更复杂的局部搜索邻域,允许同时交换多个中心,从而推广并扩展了他们的局部搜索算法。我们的算法实现了 $9 + \varepsilon$ 的近似比,这是局部搜索所能达到的最优结果。重要的是,我们证明了该方法能带来显著的实践改进:在多个数据集上,我们的结果相比 Lattanzi 和 Sohler(ICML 2019)的方法有显著的质量提升。