Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.
翻译:矩阵值数据在众多应用中日益普遍。现有针对此类数据的大多数聚类方法均基于均值模型,并未考虑特征间的依赖结构,而这种结构在高维场景或均值信息不可用时具有重要信息价值。为从依赖结构中提取聚类信息,我们针对以矩阵形式排列的特征提出了一种新的潜变量模型,其中包含表示行与列聚类的未知隶属矩阵。在该模型下,我们进一步提出一类以加权协方差矩阵之差作为相异度度量的层次聚类算法。理论上,我们证明在温和条件下,该算法在高维场景中能达到聚类一致性。尽管该一致性结果适用于采用广泛加权协方差矩阵的算法,但该结果所需条件取决于权重选择。为探究权重如何影响算法理论性能,我们基于某种簇分离度量,建立了潜变量模型下聚类的极小化最优下界。基于这些结果,我们识别出最优权重——采用该权重可保证算法达到极小化最优速率。同时讨论了最优权重算法的实际实现。模拟研究表明,在调整兰德指数(ARI)指标上,我们的算法优于现有方法。将该方法应用于基因组数据集,得到了有意义的解释。