A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches.
翻译:本文针对无自然顺序的分类数据,提出了一种基于模型的聚类方法。该方法利用汉明距离定义了一族用于数据建模的概率质量函数。该函数族中的元素被视为具有未知组分数量的有限混合模型的核函数。我们推导出汉明分布模型参数的共轭贝叶斯推断方法。该混合模型被置于贝叶斯非参数框架中,并开发了一种跨维度的分块吉布斯采样器,能够对聚类数量、聚类结构以及组别特定参数进行完整的贝叶斯推断,其计算效率相较于传统的可逆跳转算法更具优势。当组分数量固定时,所提出的模型可简化为一个简约的潜在类别模型作为特例。通过模拟研究和基准数据集的验证,本模型在聚类恢复性能上较现有方法有所提升。