K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.
翻译:K-means聚类是无监督学习中的核心方法,但众所周知其对异常值、分布偏移和有限样本量较为脆弱。通过将K-means视为对经验分布的Lloyd-Max量化,我们开发了一种分布鲁棒变体以抵御此类病理现象。我们假设未知总体分布位于经验分布周围的Wasserstein-2球内。在此设定下,我们寻求最小化该模糊集上最坏情况期望平方距离的聚类中心,由此得出极小化极大公式。一个可处理的对偶问题产生了软聚类方案,将硬分配替换为平滑加权分配。我们提出了一种高效的块坐标下降算法,具有可证明的单调递减性和局部线性收敛性。标准基准测试和大规模合成数据上的实验表明,该方法在异常值检测和噪声鲁棒性方面取得了显著提升。