Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.
翻译:在实际应用中,聚类算法对离群点的鲁棒性至关重要。在$\textit{稳健 $k$-均值}$问题(即含离群点的$k$-均值问题)中,目标是移除$z$个离群点并最小化剩余点的$k$-均值代价。尽管稳健$k$-均值与离群点检测存在紧密联系,但关于$\textit{经典离群点检测启发式方法}$用于稳健$k$-均值的理论与实证理解仍然有限。本文证明,在关于最优聚类规模的一个实用假设下,仅移除具有较大$K$近邻距离的点即可获得与先前工作相当的近似保证性能:它将稳健$k$-均值问题常数因子地归约为标准$k$-均值问题,且无需像现有方法通常要求的那样引入额外中心或丢弃多余离群点。在真实数据集上的实验表明,我们的方法在聚类代价和运行时间方面优于或匹配几种更复杂的算法。这些结果证明,基于简单KNN的启发式方法对稳健聚类可能出奇地有效,为桥接离群点检测与聚类技术提供了新的可能性。