Average-link is widely recognized as one of the most popular and effective methods for building hierarchical agglomerative clustering. The available theoretical analyses show that this method has a much better approximation than other popular heuristics, as single-linkage and complete-linkage, regarding variants of Dasgupta's cost function [STOC 2016]. However, these analyses do not separate average-link from a random hierarchy and they are not appealing for metric spaces since every hierarchical clustering has a 1/2 approximation with regard to the variant of Dasgupta's function that is employed for dissimilarity measures [Moseley and Yang 2020]. In this paper, we present a comprehensive study of the performance of average-link in metric spaces, regarding several natural criteria that capture separability and cohesion and are more interpretable than Dasgupta's cost function and its variants. We also present experimental results with real datasets that, together with our theoretical analyses, suggest that average-link is a better choice than other related methods when both cohesion and separability are important goals.
翻译:平均链接被广泛认为是构建层次凝聚聚类最流行且最有效的方法之一。现有理论分析表明,在Dasgupta成本函数[STOC 2016]的变体方面,该方法相比其他流行启发式算法(如单链接和全链接)具有更好的近似性。然而,这些分析未能将平均链接与随机层次结构区分开来,且对度量空间缺乏吸引力——因为针对用于相异性度量的Dasgupta函数变体,所有层次聚类都具有1/2近似度[Moseley and Yang 2020]。本文针对度量空间中平均链接的性能展开系统性研究,重点关注若干能捕捉可分离性与内聚性的自然标准,这些标准比Dasgupta成本函数及其变体更具可解释性。我们同时提供了真实数据集的实验结果,这些结果与理论分析共同表明:当内聚性与可分离性均为重要目标时,平均链接是优于其他相关方法的选择。