Probabilistic record linkage is often used to match records from two files, in particular when the variables common to both files comprise imperfectly measured identifiers like names and demographic variables. We consider bipartite record linkage settings in which each entity appears at most once within a file, i.e., there are no duplicates within the files, but some entities appear in both files. In this setting, the analyst desires a point estimate of the linkage structure that matches each record to at most one record from the other file. We propose an approach for obtaining this point estimate by maximizing the expected $F$-score for the linkage structure. We target the approach for record linkage methods that produce either (an approximate) posterior distribution of the unknown linkage structure or probabilities of matches for record pairs. Using simulations and applications with genuine data, we illustrate that the $F$-score estimators can lead to sensible estimates of the linkage structure.
翻译:概率性记录链接常用于匹配两个文件中的记录,特别是当两个文件共有的变量包含不完美测量的标识符(如姓名和人口统计变量)时。我们考虑二分记录链接场景,其中每个实体在每个文件中至多出现一次(即文件内部无重复),但部分实体同时出现在两个文件中。在此场景下,分析者希望获得链接结构的点估计,该结构将每条记录与另一个文件中的至多一条记录相匹配。我们提出一种方法,通过最大化链接结构的期望 $F$ 值来获取该点估计。该方法针对以下记录链接技术设计:能够生成未知链接结构(近似)后验分布或记录对匹配概率的方法。通过仿真实验和真实数据应用,我们证明 $F$ 值估计器能够得出合理的链接结构估计。