The $k$-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a classical and time-tested algorithm for the $k$-means problem. While being very practical, the algorithm also has good theoretical guarantees: its solution is $O(\log k)$-approximate, in expectation. In a recent work, Bhattacharya, Eube, Roglin, and Schmidt [ESA 2020] considered the following question: does the algorithm retain its guarantees if we allow for a slight adversarial noise in the sampling probability distributions used by the algorithm? This is motivated e.g. by the fact that computations with real numbers in $k$-means++ implementations are inexact. Surprisingly, the analysis under this scenario gets substantially more difficult and the authors were able to prove only a weaker approximation guarantee of $O(\log^2 k)$. In this paper, we close the gap by providing a tight, $O(\log k)$-approximate guarantee for the $k$-means++ algorithm with noise.
翻译:Arthur和Vassilvitskii [SODA 2007]提出的k-means++算法是解决k-means问题的经典且经得起时间考验的算法。该算法不仅非常实用,还具有良好的理论保证:其期望解是O(log k)-近似。在近期工作中,Bhattacharya、Eube、Roglin和Schmidt [ESA 2020]考虑了以下问题:如果允许算法使用的采样概率分布存在轻微对抗性噪声,该算法是否仍能保持其保证?这一问题的动机源于例如k-means++实现中实数计算的不精确性。令人惊讶的是,这种情形下的分析变得异常困难,作者仅能证明较弱的O(log² k)近似保证。在本文中,我们通过提供带噪声的k-means++算法的紧致O(log k)-近似保证,填补了这一差距。