Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favours a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure.
翻译:贝叶斯聚类通常依赖于混合模型,每个分量被视为一个不同的簇。在定义分量参数和权重的先验后,通常使用马尔可夫链蒙特卡洛(MCMC)算法从分量标签的后验分布中生成样本。随后,通过最小化倾向于相似于分量标签的聚类损失函数的期望值对数据进行聚类。遗憾的是,尽管这些方法被常规应用,聚类结果对核函数的误设定高度敏感。例如,若使用高斯核函数但簇内数据真实密度略微偏离高斯分布,则簇会被分割为多个高斯分量。为解决此问题,我们提出了局部密度融合(FOLD)方法——一种利用核后验融合分量的新型聚类方法。FOLD具有完全贝叶斯决策理论依据,可自然导出不确定性量化,能作为混合模型MCMC算法的附加模块轻松实现,并倾向于生成少量不同簇。我们为FOLD提供了理论支持,包括核误设定下的聚类最优性。在模拟实验和实际数据中,FOLD通过最小化簇数量同时推断有意义的群体结构,表现优于其他竞争方法。