Probabilistic record linkage is often used to match records from two files, in particular when the variables common to both files comprise imperfectly measured identifiers like names and demographic variables. We consider bipartite record linkage settings in which each entity appears at most once within a file, i.e., there are no duplicates within the files, but some entities appear in both files. In this setting, the analyst desires a point estimate of the linkage structure that matches each record to at most one record from the other file. We propose an approach for obtaining this point estimate by maximizing the expected $F$-score for the linkage structure. We target the approach for record linkage methods that produce either (an approximate) posterior distribution of the unknown linkage structure or probabilities of matches for record pairs. Using simulations and applications with genuine data, we illustrate that the $F$-score estimators can lead to sensible estimates of the linkage structure.
翻译:概率记录链接常被用于匹配两个文件中的记录,尤其是当两个文件共有的变量包含测量不完美的标识符(如姓名和人口统计变量)时。我们考虑二分记录链接场景,其中每个实体在每个文件中最多出现一次(即文件内部无重复记录),但部分实体同时出现在两个文件中。在此场景中,分析人员希望得到链接结构的点估计,该结构将每条记录与另一个文件中的至多一条记录相匹配。我们提出一种通过最大化链接结构的期望 $F$ 分数来获取该点估计的方法。该方法适用于可生成未知链接结构的(近似)后验分布或记录对匹配概率的记录链接技术。通过模拟实验和真实数据应用,我们证明 $F$ 分数估计量能够产生合理的链接结构估计。