A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celebrated $k$-means problem) is to tackle the presence of outliers. In this paper, we propose a Partial clustering variant termed PMSSC which considers a fixed number of outliers to remove. We solve PMSSC by Integer Programming formulations and complexity results extending the ones from MSSC are studied. PMSSC is NP-hard in Euclidean space when the dimension or the number of clusters is greater than $2$. Finally, one-dimensional cases are studied: Unweighted PMSSC is polynomial in that case and solved with a dynamic programming algorithm, extending the optimality property of MSSC with interval clustering. This result holds also for unweighted $k$-medoids with outliers. A weaker optimality property holds for weighted PMSSC, but NP-hardness or not remains an open question in dimension one.
翻译:最小平方和聚类(MSSC,即著名的$k$-均值问题)的一个公认瓶颈在于处理离群值的存在。本文提出一种名为PMSSC的部分聚类变体,它考虑移除固定数量的离群值。我们通过整数规划公式求解PMSSC,并研究了扩展自MSSC的复杂性结果。PMSSC在欧氏空间中是NP难的,当维度或聚类数量大于$2$时成立。最后,研究了一维情况:非加权PMSSC在该情况下是多项式可解的,并通过动态规划算法求解,从而将MSSC的区间聚类最优性性质进行了推广。该结果同样适用于带离群值的非加权$k$-中心点问题。对于加权PMSSC,存在较弱的最优性性质,但其是否为NP难在一维空间中仍是一个悬而未决的问题。