Entity resolution (probabilistic record linkage, deduplication) is a key step in scientific analysis and data science pipelines involving multiple data sources. The objective of entity resolution is to link records without common unique identifiers that refer to the same entity (e.g., person, company). However, without identifiers, researchers need to specify which records to compare in order to calculate matching probability and reduce computational complexity. One solution is to deterministically block records based on some common variables, such as names, dates of birth or sex or use phonetic algorithms. However, this approach assumes that these variables are free of errors and completely observed, which is often not the case. To address this challenge, we have developed a Python package, BlockingPy, which uses blocking using modern approximate nearest neighbour search and graph algorithms to reduce the number of comparisons. The package supports both CPU and GPU execution. In this paper, we present the design of the package, its functionalities and two case studies related to official statistics. The presented software will be useful for researchers interested in linking data from various sources.
翻译:实体解析(概率记录链接、去重)是涉及多数据源的科学分析和数据科学流程中的关键步骤。实体解析的目标是在没有公共唯一标识符的情况下,链接指向同一实体(例如个人、公司)的记录。然而,在没有标识符的情况下,研究人员需要指定比较哪些记录以计算匹配概率并降低计算复杂度。一种解决方案是基于某些公共变量(如姓名、出生日期或性别)或使用语音算法对记录进行确定性分块。但这种方法假设这些变量没有错误且被完整观测,而实际情况往往并非如此。为应对这一挑战,我们开发了一个Python软件包BlockingPy,该包利用现代近似最近邻搜索和图算法进行分块,以减少比较次数。该软件包支持CPU和GPU执行。本文介绍了该软件包的设计、功能以及两个与官方统计相关的案例研究。所提出的软件将有助于对链接多源数据感兴趣的研究人员。