Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We introduce a general class of assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body $\mathcal{B}$). We give coresets for clustering problems with such general assignment constraints that significantly generalizes and improves known results. Notable implications include the first $\epsilon$-coreset for capacitated and fair $k$-Median with $m$ outliers in Euclidean spaces whose size is $\tilde{O}(m + k^2 \epsilon^{-4})$, generalizing and improving upon the prior bounds in [Braverman et al., FOCS' 22; Huang et al., ICLR' 23] (for capacitated $k$-Median, the coreset size bound obtained in [Braverman et al., FOCS' 22] is $\tilde{O}(k^3 \epsilon^{-6})$, and for $k$-Median with $m$ outliers, the coreset size bound obtained in [Huang et al., ICLR' 23]} is $\tilde{O}(m + k^3 \epsilon^{-5})$), and the first $\epsilon$-coreset of size $\mathrm{poly}(k \epsilon^{-1})$ for fault-tolerant clustering for various types of metric spaces.
翻译:设计小型化的\emph{核心集},用于近似保留大型数据集解的成本,已成为过去十年重要研究方向。本文针对多种一般约束聚类问题研究核心集构建。我们引入一类一般分配约束,包括聚类中心的容量约束以及数据点的分配结构约束(通过凸体$\mathcal{B}$建模)。对具有此类一般分配约束的聚类问题,我们给出了显著推广并改进现有结果的核心集。重要应用包括:首个针对欧氏空间中带$m$个离群点的容量约束公平$k$-中位数的$\epsilon$-核心集,其规模为$\tilde{O}(m + k^2 \epsilon^{-4})$,推广并改进了[Braverman等,FOCS'22;Huang等,ICLR'23]的先前界限(对于容量约束$k$-中位数,[Braverman等,FOCS'22]获得的核心集规模为$\tilde{O}(k^3 \epsilon^{-6})$;对于带$m$个离群点的$k$-中位数,[Huang等,ICLR'23]获得的核心集规模为$\tilde{O}(m + k^3 \epsilon^{-5})$),以及首个针对各类度量空间容错聚类的$\mathrm{poly}(k \epsilon^{-1})$规模$\epsilon$-核心集。