Selecting the appropriate number of clusters is a critical step in applying clustering algorithms. To assist in this process, various cluster validity indices (CVIs) have been developed. These indices are designed to identify the optimal number of clusters within a dataset. However, users may not always seek the absolute optimal number of clusters but rather a secondary option that better aligns with their specific applications. This realization has led us to introduce a Bayesian cluster validity index (BCVI), which builds upon existing indices. The BCVI utilizes either Dirichlet or generalized Dirichlet priors, resulting in the same posterior distribution. We evaluate our BCVI using the Wiroonsri index for hard clustering and the Wiroonsri-Preedasawakul index for soft clustering as underlying indices. We compare the performance of our proposed BCVI with that of the original underlying indices and several other existing CVIs, including Davies-Bouldin, Starczewski, Xie-Beni, and KWON2 indices. Our BCVI offers clear advantages in situations where user expertise is valuable, allowing users to specify their desired range for the final number of clusters. To illustrate this, we conduct experiments classified into three different scenarios. Additionally, we showcase the practical applicability of our approach through real-world datasets, such as MRI brain tumor images. These tools will be published as a new R package 'BayesCVI'.
翻译:选择合适的聚类数量是应用聚类算法的关键步骤。为辅助这一过程,研究者已开发出多种聚类有效性指标(CVIs)。这些指标旨在识别数据集中聚类的理想数量。然而,用户有时并非追求绝对最优的聚类数,而是更倾向于更符合具体应用需求的次优选项。这一认识促使我们提出一种基于现有指标的贝叶斯聚类有效性指标(BCVI)。BCVI采用狄利克雷先验或广义狄利克雷先验,并得到相同的后验分布。我们以硬聚类的Wiroonsri指标和软聚类的Wiroonsri-Preedasawakul指标为基础指标对BCVI进行验证。我们将所提出的BCVI与原始基础指标及其他现有CVI(包括Davies-Bouldin、Starczewski、Xie-Beni和KWON2指标)的性能进行比较。在需要用户专业知识的场景中,BCVI展现出显著优势,允许用户指定最终聚类数的期望范围。为说明这一点,我们设计了三种不同场景的实验。此外,我们通过真实数据集(如MRI脑肿瘤图像)展示了该方法的实际适用性。这些工具将以新的R语言包'BayesCVI'形式发布。