Uniform sampling is a highly efficient method for data summarization. However, its effectiveness in producing coresets for clustering problems is not yet well understood, primarily because it generally does not yield a strong coreset, which is the prevailing notion in the literature. We formulate \emph{stable coresets}, a notion that is intermediate between the standard notions of weak and strong coresets, and effectively combines the broad applicability of strong coresets with highly efficient constructions, through uniform sampling, of weak coresets. Our main result is that a uniform sample of size $O(ε^{-2}\log d)$ yields, with high constant probability, a stable coreset for $1$-median in $\mathbb{R}^d$ under the $\ell_1$ metric. We then leverage the powerful properties of stable coresets to easily derive new coreset constructions, all through uniform sampling, for $\ell_1$ and related metrics, such as Kendall-tau and Jaccard. We also show applications to fair rank aggregation and to approximation algorithms for $k$-median problem in these metric spaces. Our experiments validate the benefits of stable coresets in practice, in terms of both construction time and approximation quality.
翻译:均匀采样是一种高效的数据摘要方法。然而,其在为聚类问题生成核心集方面的有效性尚未得到充分理解,这主要是因为该方法通常无法产生强核心集——这是文献中的主流概念。我们提出了**稳定核心集**的概念,该概念介于标准的弱核心集与强核心集之间,并有效结合了强核心集的广泛适用性与通过均匀采样高效构建弱核心集的特点。我们的主要结果表明,在ℓ₁度量下,对于ℝ^d空间中的1-中位数问题,大小为O(ε^{-2}log d)的均匀样本能以高常数概率生成稳定核心集。随后,我们利用稳定核心集的强大特性,通过均匀采样轻松推导出ℓ₁及相关度量(如Kendall-tau和Jaccard)的新核心集构建方法。我们还展示了其在公平排名聚合以及这些度量空间中k-中位数问题近似算法中的应用。实验验证了稳定核心集在实践中的优势,包括构建时间和近似质量两方面。