In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.
翻译:在许多应用场景中,研究者需要识别跨多个数据文件的重叠实体。在缺乏唯一标识符的情况下,记录关联算法可辅助完成此任务。由于这些算法依赖于半识别信息,可能会遗漏代表同一实体的记录,或错误关联不代表同一实体的记录。对关联文件的分析通常忽略此类关联错误,导致对目标关联的估计产生偏差或过度精确。我们将记录关联视为缺失数据问题,并阐明支撑关联文件分析方法的关联机制。遵循缺失数据文献的分类框架,我们将这些方法归纳为三类:似然与贝叶斯方法、插补方法以及加权方法。本文总结了各类方法的假设与局限,并通过广泛的模拟场景评估其性能。