Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $\Omega$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $\Omega(1)$ cost stability, Sensitivity Sampling gives coresets of size $\tilde{O}(k/\varepsilon^2)$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $\Omega(k/\varepsilon^2)$. Our results for Sensitivity Sampling also extend to the $k$-median problem, and more general metric spaces.
翻译:核集被认为是中心聚类目标(如$k$均值)最流行的压缩范式。给定点集$P$,核集$\Omega$是一个小的加权摘要,能在$(1\pm \varepsilon)$因子内保留所有候选解$S$的代价。对于$d$维欧氏空间中的$k$均值,解$S$的代价为$\sum_{p\in P}\min_{s\in S}\|p-s\|^2$。理论和实践中都非常流行的核集构建方法是灵敏度采样,即按照点的重要性比例进行采样。我们证明,在最坏情况实例下,灵敏度采样能产生大小为$\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$的最优核集。在所有已知核集算法中,对于具有$\Omega(1)$代价稳定性的良好可聚类数据集,灵敏度采样能给出大小为$\tilde{O}(k/\varepsilon^2)$的核集,优于最坏情况的下界。值得注意的是,灵敏度采样无需事先知道代价稳定性即可利用该特性:它对数据集的聚类性具有适当敏感性,而同时又对其不敏感。我们还证明,对于仅由输入点构成的稳定实例,任何核集的大小必须为$\Omega(k/\varepsilon^2)$。我们的灵敏度采样结果也适用于$k$中位数问题和更一般的度量空间。