Constructing small-sized coresets for various clustering problems in different metric spaces has attracted significant attention for the past decade. A central problem in the coreset literature is to understand what is the best possible coreset size for $(k,z)$-clustering in Euclidean space. While there has been significant progress in the problem, there is still a gap between the state-of-the-art upper and lower bounds. For instance, the best known upper bound for $k$-means ($z=2$) is $\min \{O(k^{3/2} \varepsilon^{-2}),O(k \varepsilon^{-4})\}$ [1,2], while the best known lower bound is $\Omega(k\varepsilon^{-2})$ [1]. In this paper, we make significant progress on both upper and lower bounds. For a large range of parameters (i.e., $\varepsilon, k$), we have a complete understanding of the optimal coreset size. In particular, we obtain the following results: (1) We present a new coreset lower bound $\Omega(k \varepsilon^{-z-2})$ for Euclidean $(k,z)$-clustering when $\varepsilon \geq \Omega(k^{-1/(z+2)})$. In view of the prior upper bound $\tilde{O}_z(k \varepsilon^{-z-2})$ [1], the bound is optimal. The new lower bound also implies improved lower bounds for $(k,z)$-clustering in doubling metrics. (2) For the upper bound, we provide efficient coreset construction algorithms for $(k,z)$-clustering with improved or optimal coreset sizes in several metric spaces. In particular, we provide an $\tilde{O}_z(k^{\frac{2z+2}{z+2}} \varepsilon^{-2})$-sized coreset, with a unfied analysis, for $(k,z)$-clustering for all $z\geq 1$ in Euclidean space. [1] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC'22. [2] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS'22.
翻译:在过去十年中,为不同度量空间中的各类聚类问题构造小型核心集引起了广泛关注。核心集研究中的一个核心问题是理解欧几里得空间中 $(k,z)$-聚类可能达到的最佳核心集大小。尽管该问题已取得显著进展,但当前最优上界与下界之间仍存在差距。例如,已知的 $k$-均值($z=2$)最优上界为 $\min \{O(k^{3/2} \varepsilon^{-2}),O(k \varepsilon^{-4})\}$ [1,2],而最优下界为 $\Omega(k\varepsilon^{-2})$ [1]。本文在上界和下界方面均取得了重要进展。对于大范围参数(即 $\varepsilon, k$),我们已完全理解最优核心集大小。具体结果如下:(1) 针对欧几里得 $(k,z)$-聚类,当 $\varepsilon \geq \Omega(k^{-1/(z+2)})$ 时,我们提出了一个新的核心集下界 $\Omega(k \varepsilon^{-z-2})$。结合先前上界 $\tilde{O}_z(k \varepsilon^{-z-2})$ [1],该下界是最优的。新下界还改进了加倍度量中 $(k,z)$-聚类的下界。(2) 在上界方面,我们为若干度量空间中的 $(k,z)$-聚类提供了高效的核心集构造算法,其核心集大小得到改进或达到最优。特别地,我们通过统一分析,为欧几里得空间中所有 $z\geq 1$ 的 $(k,z)$-聚类构造了大小为 $\tilde{O}_z(k^{\frac{2z+2}{z+2}} \varepsilon^{-2})$ 的核心集。 [1] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC'22. [2] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS'22.