We design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability from Impagliazzo et al. [2022]. According to this definition, a clustering algorithm is replicable if, with high probability, its output induces the exact same partition of the sample space after two executions on different inputs drawn from the same distribution, when its internal randomness is shared across the executions. We propose such algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable $O(1)$-approximation algorithm for statistical Euclidean $k$-medians ($k$-means) with $\operatorname{poly}(d)$ sample complexity. We also describe an $O(1)$-approximation algorithm with an additional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit with $\exp(d)$ sample complexity. In addition, we provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
翻译:我们基于Impagliazzo等人[2022]近期提出的可复现性概念,在设计统计聚类背景下的可复现算法。根据该定义,若一个聚类算法在两次执行过程中共享内部随机性,且分别作用于从同一分布中抽取的不同输入数据时,能以高概率在样本空间上产生完全相同的划分结果,则该算法具有可复现性。针对统计$k$-中位数、统计$k$-均值和统计$k$-中心问题,我们通过黑箱方式利用其组合形式的近似计算流程,提出了相应的可复现算法。具体而言,我们为统计欧几里得$k$-中位数($k$-均值)问题设计了一个可复现的$O(1)$-近似算法,其样本复杂度为$\operatorname{poly}(d)$。对于统计欧几里得$k$-中心问题,我们描述了一个具有额外$O(1)$加性误差的$O(1)$-近似算法,但其样本复杂度为$\exp(d)$。此外,我们利用sklearn中的k-means++实现作为黑箱,在二维合成分布上进行了实验,从而验证了我们的理论结果。