This paper describes a new process and software system, the Case Count Metric System (CCMS), for systematically comparing and analyzing the outcomes of two different ER clustering processes acting on the same dataset when the true linking (labeling) is not known. The CCMS produces a set of counts that describe how the clusters produced by the first process are transformed by the second process based on four possible transformation scenarios. The transformations are that a cluster formed in the first process either remains unchanged, merges into a larger cluster, is partitioned into smaller clusters, or otherwise overlaps with multiple clusters formed in the second process. The CCMS produces a count for each of these cases, accounting for every cluster formed in the first process. In addition, when run in analysis mode, the CCMS program can assist the user in evaluating these changes by displaying the details for all changes or only for certain types of changes. The paper includes a detailed description of the CCMS process and program and examples of how the CCMS has been applied in university and industry research.
翻译:本文描述了一种新的处理流程与软件系统——案例计数度量系统(CCMS),用于在真实链接(标注)未知的情况下,系统性地比较和分析针对同一数据集执行的两个不同实体解析聚类过程的结果。CCMS基于四种可能的转换场景,生成一组描述第一个过程产生的聚类如何被第二个过程转换的计数。这些转换包括:第一个过程中形成的聚类保持不变、合并为更大聚类、分割为更小聚类,或与第二个过程中形成的多个聚类发生重叠。CCMS为每种情况生成相应计数,并涵盖第一个过程中形成的所有聚类。此外,在分析模式下运行时,CCMS程序可通过展示全部变更或特定类型变更的详细信息,协助用户评估这些变化。本文详细阐述了CCMS的处理流程与程序设计,并提供了该系统在高校与产业研究中应用的具体案例。