The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.
翻译:在应用聚类分析时,确定最优聚类数量是主要关注点之一。为此,已有多种聚类有效性指标被提出来解决该问题。然而,在某些情况下,可能存在多个可被选为最终聚类数的选项。现有的大多数研究忽视了这一方面。本研究提出了一种基于相关性的模糊聚类有效性指标,称为Wiroonsri-Preedasawakul (WP)指标。该指标基于数据点对之间的实际距离与该对调整后的质心距离之间的相关性定义。我们通过模糊C均值算法,在四类数据集(人工数据集、真实数据集、带排名模拟数据集及图像数据集)上,将WP指标与包括Xie-Beni、Pakhira-Bandyopadhyay-Maulik、Tang、Wu-Li、广义C和Kwon2在内的多种现有指标进行了评估与比较。总体而言,WP指标在准确检测最优聚类数量及提供精确的次级选项方面,优于大多数甚至所有对比指标。此外,即使模糊参数$m$设为较大值,该指标仍保持有效。本研究所使用的R包UniversalCVI已发布于https://CRAN.R-project.org/package=UniversalCVI。