Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters.
翻译:给定数据集中的类别标签是否对应有意义的聚类,对于使用真实数据集评估聚类算法至关重要。该性质可通过可分性度量进行量化。现有文献综述表明,无论是基于分类的复杂度度量还是聚类有效性指标(CVI),均未能充分整合密度聚类中可分性的核心要素:类间分离性与类内连通性。新提出的度量指标(密度聚类可分性指数,DCSI)旨在量化这两个特征,同时可作为CVI使用。基于合成数据的广泛实验表明,DCSI虽然与通过调整兰德指数(ARI)衡量的DBSCAN性能强相关,但在处理含有不适合密度硬聚类的重叠类别的多类数据集时缺乏鲁棒性。对常用真实数据集的详细评估显示,DCSI能够准确识别不构成有意义聚类的接触或重叠类别。