Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used.
翻译:贝叶斯非参数混合模型广泛应用于观测数据的聚类。然而,该方法的一个主要缺陷在于,估计得到的划分往往呈现非平衡的簇频率,即少数簇占据主导地位,而大量簇仅包含稀疏样本。这一特性导致结果通常难以解释,除非我们愿意忽略大量观测数据和簇。通过将后验分布视为惩罚似然,我们阐明了这种不平衡性可直接归因于划分估计中涉及的代价函数。基于这一发现,我们提出了一种新的聚类配置贝叶斯估计量。该估计量等价于一种后处理过程,能够减少稀疏簇的数量并增强可解释性。该过程以贝叶斯估计的熵正则化形式实现。相较于其他策略,该方法不仅计算上更为便捷,而且在理论上可被证明是对点估计所用贝叶斯损失函数的修正,因此可适用于任意簇的后验分布,无需依赖特定模型。