Clustering is an important task in many areas of knowledge: medicine and epidemiology, genomics, environmental science, economics, visual sciences, among others. Methodologies to perform inference on the number of clusters have often been proved to be inconsistent, and introducing a dependence structure among the clusters implies additional difficulties in the estimation process. In a Bayesian setting, clustering is performed by considering the unknown partition as a random object and define a prior distribution on it. This prior distribution may be induced by models on the observations, or directly defined for the partition. Several recent results, however, have shown the difficulties in consistently estimating the number of clusters, and, therefore, the partition. The problem itself of summarising the posterior distribution on the partition remains open, given the large dimension of the partition space. This work aims at reviewing the Bayesian approaches available in the literature to perform clustering, presenting advantages and disadvantages of each of them in order to suggest future lines of research.
翻译:聚类是医学与流行病学、基因组学、环境科学、经济学、视觉科学等多个知识领域中的重要任务。对聚类数量进行推断的方法常被证明存在不一致性,而引入聚类间的依赖结构则会在估计过程中增加额外困难。在贝叶斯框架下,聚类通过将未知划分视为随机对象并为其定义先验分布来实现。该先验分布可基于观测模型推导得出,也可直接针对划分本身进行定义。然而,近期多项研究结果表明,持续一致地估计聚类数量乃至划分结构存在显著困难。由于划分空间的维度规模庞大,如何对划分的后验分布进行总结这一问题至今尚未解决。本文旨在综述文献中现有的贝叶斯聚类方法,通过阐明各类方法的优缺点,为未来研究方向提供建议。