ABCDE is a sophisticated technique for evaluating differences between very large clusterings. Its main metric that characterizes the magnitude of the difference between two clusterings is the JaccardDistance, which is a true distance metric in the space of all clusterings of a fixed set of (weighted) items. The JaccardIndex is the complementary metric that characterizes the similarity of two clusterings. Its relationship with the JaccardDistance is simple: JaccardDistance + JaccardIndex = 1. This paper decomposes the JaccardDistance and the JaccardIndex further. In each case, the decomposition yields Impact and Quality metrics. The Impact metrics measure aspects of the magnitude of the clustering diff, while Quality metrics use human judgements to measure how much the clustering diff improves the quality of the clustering. The decompositions of this paper offer more and deeper insight into a clustering change. They also unlock new techniques for debugging and exploring the nature of the clustering diff. The new metrics are mathematically well-behaved and they are interrelated via simple equations. While the work can be seen as an alternative formal framework for ABCDE, we prefer to view it as complementary. It certainly offers a different perspective on the magnitude and the quality of a clustering change, and users can use whatever they want from each approach to gain more insight into a change.
翻译:ABCDE是一种用于评估超大规模聚类间差异的精密技术。其表征两个聚类差异程度的核心指标是Jaccard距离——该指标在固定(加权)项目集合的所有聚类空间中构成真正的距离度量。Jaccard指数则是表征两个聚类相似性的互补指标,其与Jaccard距离的关系可简化为:Jaccard距离 + Jaccard指数 = 1。本文对Jaccard距离与Jaccard指数进行了进一步分解。两种分解均产生影响力指标与质量指标:影响力指标用于衡量聚类差异的幅度特征,而质量指标则通过人工判断来度量聚类差异对聚类质量的提升程度。本文提出的分解方法为聚类变化提供了更丰富、更深刻的洞察,同时解锁了用于调试和探索聚类差异本质的新技术。新指标在数学上具有良好的性质,并通过简单方程相互关联。虽然这项工作可视为ABCDE的替代性形式化框架,但我们更倾向于将其视作补充性方法。它无疑为聚类变化的幅度与质量提供了不同的观察视角,用户可根据需要从两种方法中择取所需内容,以更深入地理解聚类变化。