In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
翻译:本文为成熟的凝聚式聚类算法提供了新的视角,重点聚焦于层次结构的恢复。我们推荐标准算法的一个简单变体,其中簇的合并基于最大平均点积,而非最小距离或簇内方差等标准。我们证明,在通用概率图模型框架下,该算法输出的树结构能够可靠地估计数据中的生成性层次结构。关键技术突破在于:理解该模型中层次信息如何转化为可从数据中恢复的树几何结构,以及刻画样本规模与数据维度同步增长带来的优势。通过真实数据实验,我们证明了该方法在树结构恢复性能上优于UPGMA、Ward法和HDBSCAN等现有方法。