This paper investigates the computational and statistical limits in clustering matrix-valued observations. We propose a low-rank mixture model (LrMM), adapted from the classical Gaussian mixture model (GMM) to treat matrix-valued observations, which assumes low-rankness for population center matrices. A computationally efficient clustering method is designed by integrating Lloyd's algorithm and low-rank approximation. Once well-initialized, the algorithm converges fast and achieves an exponential-type clustering error rate that is minimax optimal. Meanwhile, we show that a tensor-based spectral method delivers a good initial clustering. Comparable to GMM, the minimax optimal clustering error rate is decided by the separation strength, i.e., the minimal distance between population center matrices. By exploiting low-rankness, the proposed algorithm is blessed with a weaker requirement on the separation strength. Unlike GMM, however, the computational difficulty of LrMM is characterized by the signal strength, i.e., the smallest non-zero singular values of population center matrices. Evidence is provided showing that no polynomial-time algorithm is consistent if the signal strength is not strong enough, even though the separation strength is strong. Intriguing differences between estimation and clustering under LrMM are discussed. The merits of low-rank Lloyd's algorithm are confirmed by comprehensive simulation experiments. Finally, our method outperforms others in the literature on real-world datasets.
翻译:本文研究矩阵型观测数据在聚类中的计算与统计极限。我们提出一种低秩混合模型(LrMM),该模型改编自经典高斯混合模型(GMM)以处理矩阵型观测数据,假设总体中心矩阵具有低秩性。通过整合劳埃德算法与低秩近似,设计了一种计算高效的聚类方法。在良好初始化条件下,该算法快速收敛,并达到指数型聚类误差率,该误差率具有极小极大最优性。同时,我们证明基于张量的谱方法可提供良好的初始聚类。与GMM类似,极小极大最优聚类误差率由分离强度(即总体中心矩阵之间的最小距离)决定。通过利用低秩性,所提算法对分离强度的要求更为宽松。然而,与GMM不同,LrMM的计算难度由信号强度(即总体中心矩阵的最小非零奇异值)表征。证据表明,若信号强度不足,即使分离强度很强,也不存在一致的多项式时间算法。本文还讨论了LrMM下估计与聚类之间的显著差异。综合仿真实验验证了低秩劳埃德算法的优越性。最后,在真实数据集上,我们的方法优于文献中的其他方法。