Relative Validity Indices (RVIs) such as the Silhouette Width Criterion, Calinski-Harabasz and Davie's Bouldin indices are the most popular tools for evaluating and optimising applications of clustering. Their ability to rank collections of candidate partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. Beyond these more conventional tasks, many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches such as data normalisation procedures, data representation methods, and distance measures. The authors are not aware of any studies that have attempted to establish the suitability of RVIs for such comparisons. Moreover, given the impact of these aspects on pairwise similarities, it is not even immediately obvious how RVIs should be implemented when comparing these aspects. In this study, we conducted experiments with seven common RVIs on over 2.7 million clustering partitions for both synthetic and real-world datasets, encompassing feature-vector and time-series data. Our findings suggest that RVIs are not well-suited to these unconventional tasks, and that conclusions drawn from such applications may be misleading. It is recommended that normalisation procedures, representation methods, and distance measures instead be selected using external validation on high quality labelled datasets or carefully designed outcome-oriented objective criteria, both of which should be informed by relevant domain knowledge and clustering aims.
翻译:相对有效性指标(RVIs),如轮廓宽度准则、Calinski-Harabasz指数和Davies-Bouldin指数,是评估和优化聚类应用最常用的工具。它们对候选划分集合进行排序的能力已被用于指导聚类数量的选择,以及比较不同聚类算法产生的划分。除了这些更传统的任务,文献中还可以找到许多例子,其中RVIs被用于比较和选择聚类的其他方面,如数据归一化程序、数据表示方法和距离度量。作者尚未发现有任何研究试图确定RVIs在此类比较中的适用性。此外,考虑到这些方面对成对相似性的影响,在比较这些方面时,如何实施RVIs甚至不是显而易见的。在本研究中,我们针对合成数据集和真实世界数据集(包括特征向量和时间序列数据)中的超过270万个聚类划分,对七种常见的RVIs进行了实验。我们的研究结果表明,RVIs并不适合这些非常规任务,而且从这类应用中得出的结论可能具有误导性。建议应使用高质量标记数据集上的外部验证或精心设计的面向结果的目标标准来选择归一化程序、表示方法和距离度量,这两者都应基于相关领域知识和聚类目标。