The silhouette coefficient quantifies, for each observation, the balance between within-cluster cohesion and between-cluster separation, taking values in the range [-1, 1]. The average silhouette width (ASW) is a widely used internal measure of clustering quality, with higher values indicating more cohesive and well-separated clusters. However, the dataset-specific maximum of ASW is typically unknown, and the standard upper limit of 1 is rarely attainable. In this work, we derive for each data point a sharp upper bound on its silhouette width and aggregate these to obtain a canonical upper bound of the ASW. This bound-often substantially below 1-enhances the interpretability of empirical ASW values by providing guidance on how close a given clustering result is to the best possible outcome for that dataset. We evaluate the usefulness of the upper bound on a variety of datasets and conclude that it can meaningfully enrich cluster quality evaluation; however, its practical relevance depends on the specific dataset. Finally, we extend the framework to establish an upper bound of the macro-averaged silhouette.
翻译:轮廓系数通过量化每个观测点的类内内聚性与类间分离性之间的平衡程度,其取值范围为[-1, 1]。平均轮廓宽度(ASW)是一种广泛使用的聚类质量内部评价指标,其值越高表明聚类结果的内聚性与分离性越优。然而,ASW在特定数据集上的最大值通常是未知的,且标准上限值1在实际中极少能达到。本文中,我们为每个数据点推导出其轮廓宽度的严格上界,并通过聚合这些上界得到ASW的规范上界。该上界——通常显著低于1——通过揭示给定聚类结果距离该数据集可能达到的最优结果有多接近,从而增强了经验ASW值的可解释性。我们在多种数据集上评估了该上界的实用性,结论表明其能够有效丰富聚类质量评估体系;然而,其实际相关性取决于具体的数据集特性。最后,我们将该框架扩展至建立宏观平均轮廓指标的上界。