Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures.
翻译:最小生成树为众多模式识别活动中的数据集提供了便捷的表示方式,并且计算速度相对较快。本文量化了它们在数据聚类任务中的实际意义。通过识别最佳(先知)算法与大量基准数据中专家标签之间一致性的上界,我们发现最小生成树方法总体而言具有极强的竞争力。此外,本文并非提出另一个在有限示例集上表现良好的算法,而是对现有最先进的基于最小生成树的分割方案进行回顾、研究、扩展与泛化,由此衍生出若干新颖且有趣的方法。结果表明,Genie方法与信息论方法往往优于非最小生成树算法,如k-means、高斯混合模型、谱聚类、BIRCH以及经典的层次凝聚算法。