Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that no ground truth is available. The difficulty to find a universal evaluation criterion is a consequence of the ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, stability alone is not well-suited to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel clustering validation criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically demonstrate the effectiveness of our criterion to select the number of clusters and compare it with existing methods. Code is available at https://github.com/FlorentF9/skstab.
翻译:模型选择是非参数聚类中的一大挑战。由于没有真实标签可用,评估聚类结果缺乏普遍认可的方法。难以找到通用评估准则的根本原因在于聚类目标的定义不够明确。在此背景下,聚类稳定性作为一种自然且与模型无关的原则应运而生:算法应能在数据中识别出稳定结构。若多次从同一潜在分布中采样数据集,算法应能给出相似的分区。然而,稳定性本身并不适用于确定聚类数。例如,它无法检测聚类数是否过少。我们提出一个新原则:良好的聚类应具有稳定性,且每个聚类内部不应存在稳定分区。这一原则衍生出一种基于聚类间与聚类内稳定性的新型聚类验证准则,克服了以往基于稳定性的方法的局限性。我们通过实验证明了该准则在选择聚类数方面的有效性,并将其与现有方法进行了比较。代码已开源:https://github.com/FlorentF9/skstab。