We design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability from Impagliazzo et al. [2022]. According to this definition, a clustering algorithm is replicable if, with high probability, its output induces the exact same partition of the sample space after two executions on different inputs drawn from the same distribution, when its internal randomness is shared across the executions. We propose such algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable $O(1)$-approximation algorithm for statistical Euclidean $k$-medians ($k$-means) with $\operatorname{poly}(d)$ sample complexity. We also describe an $O(1)$-approximation algorithm with an additional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit with $\exp(d)$ sample complexity. In addition, we provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
翻译:我们设计在统计聚类背景下、基于Impagliazzo等人[2022]最新提出的可复现性概念的可复现算法。根据该定义,若某聚类算法在内部随机性共享的条件下,对同一分布下不同输入样本执行两次后,其输出以高概率在样本空间上诱导出完全相同的划分,则该算法是可复现的。我们通过黑箱方式利用统计$k$-中位数、统计$k$-均值和统计$k$-中心问题组合优化版本的近似求解程序,提出了相应算法。具体而言,我们给出了一个统计欧几里得$k$-中位数($k$-均值)的可复现$O(1)$-近似算法,其样本复杂度为$\operatorname{poly}(d)$。同时,我们描述了一个统计欧几里得$k$-中心的$O(1)$-近似算法(额外增加$O(1)$的加性误差),但其样本复杂度为$\exp(d)$。此外,我们利用sklearn中的$k$-means++实现作为黑箱,在二维合成分布上进行了实验,验证了理论结果。