The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging, due the discrete, unordered nature and massive dimension of the space. While recent advancements provide a single clustering estimate to represent the posterior, this ignores uncertainty and may even be unrepresentative in instances where the posterior is multimodal. To enhance our understanding of uncertainty, we propose a WASserstein Approximation for Bayesian clusterIng (WASABI), which summarizes the posterior samples with not one, but multiple clustering estimates, each corresponding to a different part of the partition space that receives substantial posterior mass. Specifically, we find such clustering estimates by approximating the posterior distribution in a Wasserstein distance sense, equipped with a suitable metric on the partition space. An interesting byproduct is that a locally optimal solution can be found using a k-medoids-like algorithm on the partition space to divide the posterior samples into groups, each represented by one of the clustering estimates. Using synthetic and real datasets, we show that WASABI helps to improve the understanding of uncertainty, particularly when clusters are not well separated or when the employed model is misspecified.
翻译:贝叶斯聚类方法常因其能够提供划分结构的不确定性而受到重视。然而,由于聚类空间具有离散、无序且维度巨大的特性,对聚类结构的后验分布进行总结颇具挑战。尽管近期进展提供了单一聚类估计来代表后验分布,但这忽略了不确定性,甚至在后验分布呈多模态的情况下可能缺乏代表性。为深化对不确定性的理解,我们提出了一种用于贝叶斯聚类的Wasserstein近似方法(WASABI),该方法并非使用单一聚类估计,而是使用多个聚类估计来总结后验样本,每个估计对应于划分空间中具有显著后验质量的不同区域。具体而言,我们通过在划分空间上配备合适的度量,以Wasserstein距离意义近似后验分布来寻找此类聚类估计。一个有趣的副产品是,可以在划分空间上使用类似k-medoids的算法将后验样本划分为若干组,每组由一个聚类估计代表,从而找到局部最优解。通过使用合成数据集和真实数据集,我们证明WASABI有助于提升对不确定性的理解,尤其在簇分离不佳或所采用模型存在误设的情况下。