This paper studies high-dimensional sparse clustering, a combinatorial NP-hard problem arising from the bilinear coupling between cluster assignment and feature selection. We analyze semidefinite programming (SDP) relaxations of $K$-means and establish minimax separation bounds, demonstrating that these relaxations are theoretically robust to feature over-selection: exact recovery is preserved even in the presence of non-informative features. Leveraging this robustness, we propose a block-coordinate ascent framework that alternates between SDP-based clustering and non-conservative feature selection. To address the tendency of deterministic greedy methods to become trapped in local optima, we formulate the feature selection step as a Thompson sampling bandit problem. This approach introduces adaptive memory by aggregating historical variable-selection outcomes into posterior distributions, and selects features via posterior sampling, enabling stochastic exploration that promotes the inclusion of under-explored features and facilitates escape from local maxima. We establish conditions for consistent variable selection and exact clustering recovery, and extend the method to settings with unknown covariance through a scalable, inverse-free estimation procedure. Numerical experiments demonstrate that the proposed memory-driven approach consistently outperforms state-of-the-art sparse clustering methods.
翻译:本文研究高维稀疏聚类问题——一种由聚类分配与特征选择之间的双线性耦合产生的组合NP难问题。我们分析了$K$-均值问题的半定规划松弛,并建立了极小极大分离界,证明这些松弛在理论上对特征过选具有鲁棒性:即使存在非信息性特征,仍能保持精确恢复。利用这一鲁棒性,我们提出了一种块坐标上升框架,在基于SDP的聚类与非保守特征选择之间交替进行。为解决确定性贪婪方法易陷入局部最优的倾向,我们将特征选择步骤建模为Thompson采样多臂赌博机问题。该方法通过将历史变量选择结果聚合为后验分布来引入自适应记忆,并通过后验采样选择特征,从而实现随机探索:既能促进未充分探索特征的纳入,又能助力逃离局部极大值。我们建立了变量选择一致性与精确聚类恢复的条件,并通过可扩展的无逆估计程序将方法扩展至协方差未知的场景。数值实验表明,所提出的记忆驱动方法持续优于当前最先进的稀疏聚类方法。