Covariance matrices provide a valuable source of information about complex interactions and dependencies within the data. However, from a clustering perspective, this information has often been underutilized and overlooked. Indeed, commonly adopted distance-based approaches tend to rely primarily on mean levels to characterize and differentiate between groups. Recently, there have been promising efforts to cluster covariance matrices directly, thereby distinguishing groups solely based on the relationships between variables. From a model-based perspective, a probabilistic formalization has been provided by considering a mixture model with component densities following a Wishart distribution. Notwithstanding, this approach faces challenges when dealing with a large number of variables, as the number of parameters to be estimated increases quadratically. To address this issue, we propose a sparse Wishart mixture model, which assumes that the component scale matrices possess a cluster-dependent degree of sparsity. Model estimation is performed by maximizing a penalized log-likelihood, enforcing a covariance graphical lasso penalty on the component scale matrices. This penalty not only reduces the number of non-zero parameters, mitigating the challenges of high-dimensional settings, but also enhances the interpretability of results by emphasizing the most relevant relationships among variables. The proposed methodology is tested on both simulated and real data, demonstrating its ability to unravel the complexities of neuroimaging data and effectively cluster subjects based on the relational patterns among distinct brain regions.
翻译:协方差矩阵为数据中复杂的交互与依赖关系提供了宝贵的信息源。然而,从聚类分析的角度看,这类信息常未被充分利用甚至被忽视。事实上,广泛采用的距离度量方法往往主要依赖均值水平来刻画和区分不同群组。近期研究已开始尝试直接对协方差矩阵进行聚类,从而仅依据变量间的关联模式来区分群组。从基于模型的视角出发,可通过构建分量密度服从Wishart分布的混合模型来实现概率形式化。然而,该方法在处理高维变量时面临挑战,因为待估参数数量随变量数呈二次增长。为解决此问题,我们提出一种稀疏Wishart混合模型,该模型假设各分量的尺度矩阵具有簇依赖的稀疏度。模型估计通过最大化惩罚对数似然函数实现,其中对分量尺度矩阵施加协方差图套索惩罚。该惩罚不仅减少了非零参数数量,缓解了高维场景下的估计难题,还通过强调变量间最关键的关联关系提升了结果的可解释性。所提方法在模拟数据和真实数据上均进行了验证,结果表明其能够有效解析神经影像数据的复杂性,并依据不同脑区间的关联模式对受试者实现精准聚类。