Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is available in the R package gtclust that accompanies the paper (available at http://github.com/comeetie/gtclust)
翻译:聚类是一个众所周知且被广泛研究的问题,其变体之一称为连通性约束聚类,它接受一个图作为第二个输入,通过连通性约束来编码关于簇结构的先验信息,即每个簇必须形成该图的连通子图。本文讨论了这种设置的意义,并提出了一种在贝叶斯框架下对其进行形式化的新方法,利用关于生成树的结果来精确计算候选划分的后验概率。随后研究了一种算法解决方案,以寻找最大后验(MAP)划分,并从中提取贝叶斯树状图。本文分析了这一工具(其与经典层次聚类算法的输出相似)的价值。最后,通过实际应用展示了所提出方法的有效性。本工作的参考实现可在随文附带的R包gtclust中获得(访问地址:http://github.com/comeetie/gtclust)。