In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.
翻译:在许多应用中,研究者试图识别跨多个数据文件的重叠实体。在缺乏唯一标识符的情况下,记录关联算法可促进此任务。由于这些算法依赖于半识别信息,它们可能遗漏代表同一实体的记录,或错误关联不代表同一实体的记录。对关联文件的分析通常忽略此类关联错误,导致对目标关联的估计存在偏差或过度精确。我们将记录关联视为一个缺失数据问题,并阐明支撑关联文件分析方法的基础关联机制。遵循缺失数据文献的分类,我们将这些方法归纳为三类:似然与贝叶斯方法、插补方法以及加权方法。我们总结了这些方法的假设与局限,并在广泛的模拟场景中评估了它们的性能。