The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.
翻译:聚类分析中,确定最优聚类数量是主要关注问题之一。为此,已有多种聚类有效性指标被提出。然而,在某些情况下,存在多个可作为最终聚类数量的选项,而这一方面被现有大多数研究忽视。本研究提出一种基于相关性的模糊聚类有效性指标——Wiroonsri-Preedasawakul(WP)指标。该指标基于数据点对之间的实际距离与相对于该数据点对的调整后质心距离之间的相关性进行定义。我们将该指标与现有多种指标(包括Xie-Beni、Pakhira-Bandyopadhyay-Maulik、Tang、Wu-Li、广义C和Kwon2指标)进行性能评估与比较。我们使用模糊C均值算法,在四类数据集(人工数据集、真实世界数据集、带秩次的模拟数据集以及图像数据集)上开展评估。总体而言,WP指标在准确检测最优聚类数量及提供精确次优选项方面优于大多数(若非全部)现有指标。此外,即使模糊参数$m$设置较大值,该指标仍保持有效性。本研究使用的R包UniversalCVI可从https://CRAN.R-project.org/package=UniversalCVI获取。