Clustering algorithms are an essential part of the unsupervised data science ecosystem, and extrinsic evaluation of clustering algorithms requires a method for comparing the detected clustering to a ground truth clustering. In a general setting, the detected and ground truth clusterings may have outliers (objects belonging to no cluster), overlapping clusters (objects may belong to more than one cluster), or both, but methods for comparing these clusterings are currently undeveloped. In this note, we define a pragmatic similarity measure for comparing clusterings with overlaps and outliers, show that it has several desirable properties, and experimentally confirm that it is not subject to several common biases afflicting other clustering comparison measures.
翻译:聚类算法是无监督数据科学领域中的核心组成部分,而对聚类算法的外部评估需要一种方法,将检测到的聚类结果与真实聚类进行对比。在一般场景下,检测到的聚类结果与真实聚类可能包含离群点(即不属于任何簇的对象)、重叠簇(即对象可能属于多个簇),或兼具两者,但当前用于比较这些聚类结果的方法尚不成熟。本文定义了一种实用的相似性度量,用于比较包含重叠簇与离群点的聚类结果,论证了其具备多项理想性质,并通过实验证实该度量不受其他聚类比较度量普遍存在的若干常见偏差影响。