We propose a robust clustering framework for high-dimensional data with heavy tails and a large fraction of irrelevant variables. The method replaces the mean updates of Lloyd's $K$-means with \emph{spatial medians} to enhance robustness. For the assignment step, it admits either a Euclidean rule for computational simplicity or a robust Mahalanobis-type metric constructed from the spatial sign covariance matrix to account for heterogeneous scales and feature dependence. To handle the $p \gg n$ regime, we further introduce a simple \emph{hard feature-exclusion} mechanism that removes weakly separating dimensions based on across-center dispersion, with the exclusion threshold selected automatically via a permutation-based Gap criterion. Simulation studies under correlated Gaussian and multivariate $t$ models demonstrate that the proposed approach provides competitive clustering accuracy and improved stability relative to $K$-means and sparse $K$-means baselines.
翻译:针对高尾分布且包含大量无关变量的高维数据,我们提出一种稳健聚类框架。该方法用\emph{空间-中位数}替代Lloyd $K$-均值的均值更新,以增强稳健性。在分配步骤中,既可采用欧几里得规则以简化计算,也可使用基于空间符号协方差矩阵构建的稳健马氏型距离度量,以处理异质尺度和特征相关性。为应对 $p \gg n$ 场景,我们进一步引入简单的\emph{硬特征排除}机制,根据跨中心离散度移除弱区分维度,并通过基于置换的Gap准则自动选择排除阈值。在相关高斯分布与多元 $t$ 分布下的仿真研究表明,与 $K$-均值及稀疏 $K$-均值基线相比,所提方法在聚类准确性和稳定性方面均具有竞争力。