The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, provide two fast versions for the direct optimization, and discuss the use to choose the optimal number of clusters. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm. Additionally, we provide a variant to choose the optimal number of clusters directly.
翻译:聚类结果的评估较为困难,高度依赖于被评估的数据集以及观察者的视角。目前存在多种聚类质量度量方法,试图提供通用的指标来验证聚类结果。其中轮廓系数是一种非常流行的度量。本文讨论了基于Medoid的高效轮廓系数变体,对其性质进行了理论分析,提供了两种用于直接优化的快速版本,并探讨了如何利用该方法选择最优簇数。我们将原始轮廓系数与著名的PAM算法及其最新改进FasterPAM相结合。其中一种版本保证与原始变体产生相同的结果,并实现了$O(k^2)$的运行速度提升。在包含30000个样本且$k$=100的真实数据实验中,我们观察到相较于原始的PAMMEDSIL算法获得了10464倍的加速。此外,我们还提供了一种直接选择最优簇数的变体。