Clustering multivariate data is a pervasive task in many applied problems, particularly in social studies and life science. Model-based approaches to clustering rely on mixture models, where each mixture component corresponds to the kernel of a distribution characterizing a latent sub-group. Current methods developed within this framework employ multivariate distributions built under the assumption of independence among variables given the cluster allocation. Accordingly, possible dependence structures characterizing differences across groups are not directly accounted for during the clustering process. In this paper we consider multivariate categorical data, and introduce a model-based clustering method which employs graphical models as a tool to encode dependencies between variables. Specifically, we consider a Dirichlet Process mixture of categorical graphical models, which clusters individuals into groups that are homogeneous in terms of dependence (graphical) structure and allied parameters. We provide full Bayesian inference for the model and develop a Markov chain Monte Carlo scheme for posterior analysis. Our method is evaluated through simulations and applied to real case studies, including the analysis of genomic data and voting records. Results reveal the merits of a graphical model-based clustering, in comparison with approaches that do not explicitly account for dependencies in the multivariate distribution of variables.
翻译:多元数据聚类是许多应用问题中的普遍任务,尤其在社会科学与生命科学领域。基于模型的聚类方法依赖于混合模型,其中每个混合分量对应表征潜在子群的分布核。该框架下现有方法采用基于变量间独立性假设(给定聚类分配条件下)构建的多元分布。因此,聚类过程中并未直接考虑可能表征组间差异的依赖结构。本文针对多元分类数据,提出一种基于图模型的聚类方法,利用图模型编码变量间的依赖关系。具体而言,我们采用分类图模型的狄利克雷过程混合模型,将个体聚类为具有同质依赖(图)结构及相关参数的群组。我们为模型提供完整的贝叶斯推断框架,并开发了用于后验分析的马尔可夫链蒙特卡洛方案。通过仿真实验和实际案例研究(包括基因组数据与投票记录分析)评估了本方法性能。结果显示,相较于未显式考虑多元变量分布依赖关系的方法,基于图模型的聚类策略具有显著优势。