Graphical model-based clustering of categorical data

Clustering multivariate data is a pervasive task in many applied problems, particularly in social studies and life science. Model-based approaches to clustering rely on mixture models, where each mixture component corresponds to the kernel of a distribution characterizing a latent sub-group. Current methods developed within this framework employ multivariate distributions built under the assumption of independence among variables given the cluster allocation. Accordingly, possible dependence structures characterizing differences across groups are not directly accounted for during the clustering process. In this paper we consider multivariate categorical data, and introduce a model-based clustering method which employs graphical models as a tool to encode dependencies between variables. Specifically, we consider a Dirichlet Process mixture of categorical graphical models, which clusters individuals into groups that are homogeneous in terms of dependence (graphical) structure and allied parameters. We provide full Bayesian inference for the model and develop a Markov chain Monte Carlo scheme for posterior analysis. Our method is evaluated through simulations and applied to real case studies, including the analysis of genomic data and voting records. Results reveal the merits of a graphical model-based clustering, in comparison with approaches that do not explicitly account for dependencies in the multivariate distribution of variables.

翻译：多元数据聚类是许多应用问题中的普遍任务，尤其在社会科学与生命科学领域。基于模型的聚类方法依赖于混合模型，其中每个混合分量对应表征潜在子群的分布核。该框架下现有方法采用基于变量间独立性假设（给定聚类分配条件下）构建的多元分布。因此，聚类过程中并未直接考虑可能表征组间差异的依赖结构。本文针对多元分类数据，提出一种基于图模型的聚类方法，利用图模型编码变量间的依赖关系。具体而言，我们采用分类图模型的狄利克雷过程混合模型，将个体聚类为具有同质依赖（图）结构及相关参数的群组。我们为模型提供完整的贝叶斯推断框架，并开发了用于后验分析的马尔可夫链蒙特卡洛方案。通过仿真实验和实际案例研究（包括基因组数据与投票记录分析）评估了本方法性能。结果显示，相较于未显式考虑多元变量分布依赖关系的方法，基于图模型的聚类策略具有显著优势。

相关内容

图模型

关注 31

图模型由点和线组成的用以描述系统的图形。图模型属于结构模型（见模型），可用于描述自然界和人类社会中的大量事物和事物之间的关系。在建模中采用图模型可利用图论作为工具。按图的性质进行分析为研究各种系统特别是复杂系统提供了一种有效的方法。构成图模型的图形不同于一般的几何图形。例如，它的每条边可以被赋以权，组成加权图。权可取一定数值，用以表示距离、流量、费用等。加权图可用于研究电网络、运输网络、通信网络以及运筹学中的一些重要课题。图模型广泛应用于自然科学、工程技术、社会经济和管理等方面。见动态结构图、信号流程图、计划协调技术、图解协调技术、风险协调技术、网络技术、网络理论。

【博士论文】无监督深度图聚类中的自适应表示学习，144页pdf

专知会员服务

43+阅读 · 2023年10月21日

【爱丁堡大学博士论文】图聚类结构的学习，164页pdf

专知会员服务

39+阅读 · 2023年1月5日