Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.
翻译:给定数据集中的类别标签是否对应有意义的聚类,对于使用真实世界数据集评估聚类算法至关重要。该属性可通过可分性度量进行量化。对于基于密度的聚类而言,可分性的核心要素是类间分离性与类内连通性,而现有的基于分类的复杂度度量与聚类有效性指标均未能充分整合这两个特征。新近开发的度量指标(密度聚类可分性指数,DCSI)旨在量化这两个特性,并可作为聚类有效性指标使用。在合成数据上的大量实验表明,DCSI与通过调整兰德指数衡量的DBSCAN性能呈强相关性,但在处理存在重叠类别且不适用于基于密度的硬聚类的多类别数据集时缺乏鲁棒性。对常用真实世界数据集的详细评估显示,DCSI能够准确识别那些不对应有意义密度聚类的接触或重叠类别。