Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.
翻译:最小生成树(MST)为众多模式识别任务中的数据集提供了一种便捷的表示形式,且计算速度相对较快。本文旨在量化其在低维划分式数据聚类任务中的有效性程度。通过在一系列基准数据上,确定最优(预言机)算法与专家标注之间一致性的理论上限,我们发现基于MST的方法具有极强的竞争力。随后,我们对几种现有的、先进的基于MST的划分方案进行了回顾、研究、扩展与推广,从而提出了一些值得关注的新方法。总体而言,Genie算法与信息论方法在多数情况下优于非MST算法,例如K均值、高斯混合模型、谱聚类、Birch、基于密度的聚类以及经典的层次聚合方法。然而,我们指出该领域仍存在改进空间,因此鼓励进一步开发新型算法。