Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive $α$-spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm.
翻译:凝聚层次聚类是探索数据集中观测点之间关系最广泛使用的方法之一。然而,其贪婪特性使其对数据的微小扰动高度敏感,常产生不同的聚类结果,难以区分真实结构与虚假模式。本文证明随机化层次聚类不仅可用于衡量稳定性,还能基于聚类结果设计有效的假设检验程序。我们提出一种简单的随机化方案,配合在层次聚类树状图每个节点构建有效p值的方法,该p值量化了反对执行贪婪合并的证据。我们的检验控制了第一类错误率,适用于任何层次连接方式而无需特定情况推导,模拟显示其检验效能显著优于现有选择性推断方法。为展示p值的实际效用,我们开发了一种自适应α消耗程序来估计聚类数量,并提供了高估概率保证。在模拟和真实数据上的实验表明,该估计方法能产生强效的聚类结果,例如可用于评估随机化算法多次运行中的聚类稳定性。