Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favours a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure.
翻译:贝叶斯聚类通常依赖于混合模型,将每个分量解释为不同的聚类。在定义了分量参数和权重的先验分布后,常用马尔可夫链蒙特卡洛(MCMC)算法从分量标签的后验分布中生成样本。随后,通过最小化有利于与分量标签相似性的聚类损失函数的期望值来完成数据聚类。遗憾的是,尽管这些方法被常规实施,其聚类结果对核函数的错误设定高度敏感。例如,若使用高斯核函数但聚类内部数据的真实密度存在轻微非高斯性,聚类将被分裂成多个高斯分量。为解决该问题,我们提出局部密度融合(FOLD)这一新型聚类方法,通过核函数后验将分量融合在一起。FOLD具有完全贝叶斯决策理论依据,能自然实现不确定性量化,可作为MCMC算法对混合模型的附加模块轻松实施,且倾向于生成少量清晰聚类。我们为FOLD提供了理论支撑,包括在核函数错误设定下的聚类最优性。在仿真实验和真实数据中,FOLD通过最小化聚类数量同时推断有意义的群体结构,性能优于竞争方法。