Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this paper, we present a general framework to enable valid post-linkage inference in the challenging secondary analysis setting in which only the linked file is given. The proposed framework covers a wide selection of statistical models and can flexibly incorporate additional information about the underlying record linkage process. Specifically, we propose a mixture model for pairs of linked records whose two components reflect distributions conditional on match status, i.e., correct match or mismatch. Regarding inference, we develop a method based on composite likelihood and the EM algorithm as well as an extension towards a fully Bayesian approach. Extensive simulations and several case studies involving contemporary record linkage applications corroborate the effectiveness of our framework.
翻译:通过链接多个文件获得的数据集经常受到匹配错误的影响,这是由于记录链接过程中使用非唯一或含噪声的标识符所致。在链接文件的下游分析中考虑此类匹配错误,对于确保有效的统计推断至关重要。本文提出了一个通用框架,用于在仅给定链接文件的具有挑战性的二次分析场景中实现有效的链接后推断。该框架涵盖广泛的统计模型选择,并能灵活整合关于底层记录链接过程的额外信息。具体而言,我们针对链接记录对提出一种混合模型,其两个分量反映了匹配状态(即正确匹配或失配)下的条件分布。在推断方面,我们开发了一种基于复合似然和EM算法的方法,并将其扩展至完全贝叶斯方法。大量模拟实验及涉及当代记录链接应用的多个案例研究证实了该框架的有效性。