We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated and public datasets are conducted to support the theoretical guarantees of our method.
翻译:我们研究从亚高斯混合模型中生成的数据点聚类问题。现有方法(如Lloyd算法)虽能理论达到最优误标错误率,但通常对异常值敏感。相反,看似对对抗扰动具有鲁棒性的聚类方法尚未被证明满足最优统计保证。我们提出一种基于坐标中位数的简单鲁棒算法,即使在存在对抗性异常值的情况下,仍能获得最优误标率。当满足弱初始化条件时,该算法可在常数次迭代内达到最优错误率。在没有异常值的固定维度场景中,我们的理论保证与Lloyd算法相似。通过在多种模拟和公开数据集上的大量实验,验证了本方法的理论保证。