Clustering is an important task in many areas of knowledge: medicine and epidemiology, genomics, environmental science, economics, visual sciences, among others. Methodologies to perform inference on the number of clusters have often been proved to be inconsistent, and introducing a dependence structure among the clusters implies additional difficulties in the estimation process. In a Bayesian setting, clustering is performed by considering the unknown partition as a random object and define a prior distribution on it. This prior distribution may be induced by models on the observations, or directly defined for the partition. Several recent results, however, have shown the difficulties in consistently estimating the number of clusters, and, therefore, the partition. The problem itself of summarising the posterior distribution on the partition remains open, given the large dimension of the partition space. This work aims at reviewing the Bayesian approaches available in the literature to perform clustering, presenting advantages and disadvantages of each of them in order to suggest future lines of research.
翻译:聚类是医学与流行病学、基因组学、环境科学、经济学、视觉科学等多个知识领域的重要任务。对聚类数量进行推断的方法常被证明存在不一致性,而引入聚类间的依赖结构则会增加估计过程的难度。在贝叶斯框架中,聚类分析通过将未知划分视为随机对象并为其定义先验分布来实现。该先验分布可由观测数据模型诱导产生,也可直接针对划分进行定义。然而,近期多项研究表明,在一致估计聚类数量(进而推断划分结构)方面存在困难。由于划分空间的高维特性,关于划分后验分布的总结问题本身仍未完全解决。本文旨在系统评述现有文献中的贝叶斯聚类方法,通过分析各方法的优缺点,为未来研究方向提供建议。