Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward's method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in $O(n^2)$ time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward's method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.
翻译:层次聚合聚类(HAC)是一种广泛应用的聚类方法,其核心在于反复合并距离最近的两个簇,而簇间距离由连接函数决定。与许多聚类方法不同,HAC并不优化某个单一的显式全局目标;因此聚类质量主要通过实证评估,且连接函数的选择在实践中起着关键作用。然而,常用的经典连接函数(如单连接、平均连接和Ward方法)在现实数据集上表现出较高的变异性,且在实践中无法始终产生高质量的聚类结果。本文提出一种新颖的连接函数——Chamfer-linkage,该函数利用Chamfer距离(机器学习与计算机视觉中常用的点云间距离度量)来衡量簇间距离。我们认为,Chamfer-linkage能够满足其他常用度量难以实现的概念表征特性。在理论上,我们证明Chamfer-linkage HAC可在$O(n^2)$时间复杂度内实现,与经典连接函数的计算效率相当。实验表明,在多种不同类型的数据集上,Chamfer-linkage相较于平均连接、Ward方法等经典连接函数能持续产生更高质量的聚类结果。我们的研究确立了Chamfer-linkage作为经典连接函数的实用替代方案,从理论和实践层面拓展了层次聚类的工具集。