The $k$-means problem is a classic objective for modeling clustering in a metric space. Given a set of points in a metric space, the goal is to find $k$ representative points so as to minimize the sum of the squared distances from each point to its closest representative. In this work, we study the approximability of $k$-means in Euclidean spaces parameterized by the number of clusters, $k$. In seminal works, de la Vega, Karpinski, Kenyon, and Rabani [STOC'03] and Kumar, Sabharwal, and Sen [JACM'10] showed how to obtain a $(1+\varepsilon)$-approximation for high-dimensional Euclidean $k$-means in time $2^{(k/\varepsilon)^{O(1)}} \cdot dn^{O(1)}$. In this work, we introduce a new fine-grained hypothesis called Exponential Time for Expanders Hypothesis (XXH) which roughly asserts that there are no non-trivial exponential time approximation algorithms for the vertex cover problem on near perfect vertex expanders. Assuming XXH, we close the above long line of work on approximating Euclidean $k$-means by showing that there is no $2^{(k/\varepsilon)^{1-o(1)}} \cdot n^{O(1)}$ time algorithm achieving a $(1+\varepsilon)$-approximation for $k$-means in Euclidean space. This lower bound is tight as it matches the algorithm given by Feldman, Monemizadeh, and Sohler [SoCG'07] whose runtime is $2^{\tilde{O}(k/\varepsilon)} + O(ndk)$. Furthermore, assuming XXH, we show that the seminal $O(n^{kd+1})$ runtime exact algorithm of Inaba, Katoh, and Imai [SoCG'94] for $k$-means is optimal for small values of $k$.
翻译:k均值问题是度量空间中聚类建模的经典目标。给定度量空间中的一组点,目标是找到k个代表点,使得每个点到其最近代表点的平方距离之和最小化。本文研究了以聚类数k参数化的欧几里得空间中k均值问题的可近似性。在开创性工作中,de la Vega、Karpinski、Kenyon和Rabani [STOC'03]以及Kumar、Sabharwal和Sen [JACM'10]展示了如何在时间$2^{(k/\varepsilon)^{O(1)}} \cdot dn^{O(1)}$内获得高维欧几里得k均值的$(1+\varepsilon)$近似。本文引入了一种新的细粒度假设——扩展图指数时间假设(XXH),其大致断言:在近乎完美的顶点扩展图上,顶点覆盖问题不存在非平凡的指数时间近似算法。基于XXH假设,我们通过对欧几里得k均值近似问题的研究,证明了不存在能在$2^{(k/\varepsilon)^{1-o(1)}} \cdot n^{O(1)}$时间内实现$(1+\varepsilon)$近似的算法,从而终结了上述长期研究。这一下界是紧的,因为它与Feldman、Monemizadeh和Sohler [SoCG'07]提出的算法(运行时间为$2^{\tilde{O}(k/\varepsilon)} + O(ndk)$)相匹配。此外,基于XXH假设,我们证明了Inaba、Katoh和Imai [SoCG'94]提出的经典$O(n^{kd+1})$运行时间的k均值精确算法对于较小的k值是最优的。