Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR). To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.
翻译:张量聚类旨在从含噪声的张量观测中提取潜在聚类结构,近年来受到越来越多的关注。在张量聚类中,一个广泛研究的模型是张量块模型,该模型假定沿每个模式存在聚类结构,并在多组织基因表达分析和多层网络分析等领域具有广泛应用。然而,当前计算上可行的张量聚类方法要么局限于处理独立同分布的子高斯噪声,要么统计性能欠佳,这限制了它们在处理异方差数据和/或低信噪比(SNR)应用中的实用性。为克服这些挑战,我们提出了一种两阶段方法,名为$\mathsf{高~阶~异~方~差~聚~类}$($\mathsf{HHC}$),该方法首先通过一种新颖的谱算法——$\mathsf{阈值化~迭代~异~方~差~主~成~分~分~析}$——进行张量子空间估计,随后通过近似$k$-均值聚类获得聚类节点。令人鼓舞的是,当信噪比超过计算极限(忽略对数因子)时,我们的算法可证明实现精确聚类;这里,信噪比指节点间成对差异与噪声水平的比值,计算极限则表示能够通过多项式时间算法实现精确聚类所需的最低信噪比。全面的仿真和实际数据实验表明,我们的算法在各种设置下均优于现有算法,提供了更可靠的聚类性能。