Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood. Here, we contribute to the state-of-the-art of this literature by devising algorithms with provable guarantees for the maximization of two natural inter-group criteria, namely the minimum spacing and the minimum spanning tree spacing. The former is the minimum distance between points in different groups while the latter captures separability through the cost of the minimum spanning tree that connects all groups. We obtain results for both the unrestricted case, in which no constraint on the clusters is imposed, and for the constrained case where each group is required to have a minimum number of points. Our constraint is motivated by the fact that the popular Single Linkage, which optimizes both criteria in the unrestricted case, produces clusterings with many tiny groups. To complement our work, we present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.
翻译:用于评估聚类质量的内部度量通常考虑组内和/或组间准则。文献中已有许多论文提出具有可证近似保证的算法来优化前者,但组间准则的优化问题远未得到充分理解。本文通过设计两种自然组间准则(即最小间距和最小生成树间距)最大化的可证保证算法,为该领域的研究前沿作出贡献。前者定义为不同组间点对的最小距离,后者则通过连接所有组的最小生成树代价来刻画可分性。我们分别针对无约束情形(不施加任何聚类约束)和受约束情形(要求每个组具有最小点数限制)获得了研究结果。该约束源于以下事实:在无约束情形下优化这两个准则的流行算法Single Linkage会产生包含众多微小簇的聚类。作为工作的补充,我们基于10个真实数据集开展了实证研究,证明了所提方法在实际场景中具有优异表现。