BlockingPy: approximate nearest neighbours for blocking of records for entity resolution

Entity resolution (probabilistic record linkage, deduplication) is a key step in scientific analysis and data science pipelines involving multiple data sources. The objective of entity resolution is to link records without common unique identifiers that refer to the same entity (e.g., person, company). However, without identifiers, researchers need to specify which records to compare in order to calculate matching probability and reduce computational complexity. One solution is to deterministically block records based on some common variables, such as names, dates of birth or sex or use phonetic algorithms. However, this approach assumes that these variables are free of errors and completely observed, which is often not the case. To address this challenge, we have developed a Python package, BlockingPy, which uses blocking using modern approximate nearest neighbour search and graph algorithms to reduce the number of comparisons. The package supports both CPU and GPU execution. In this paper, we present the design of the package, its functionalities and two case studies related to official statistics. The presented software will be useful for researchers interested in linking data from various sources.

翻译：实体解析（概率记录链接、去重）是涉及多数据源的科学分析和数据科学流程中的关键步骤。实体解析的目标是在没有公共唯一标识符的情况下，链接指向同一实体（例如个人、公司）的记录。然而，在没有标识符的情况下，研究人员需要指定比较哪些记录以计算匹配概率并降低计算复杂度。一种解决方案是基于某些公共变量（如姓名、出生日期或性别）或使用语音算法对记录进行确定性分块。但这种方法假设这些变量没有错误且被完整观测，而实际情况往往并非如此。为应对这一挑战，我们开发了一个Python软件包BlockingPy，该包利用现代近似最近邻搜索和图算法进行分块，以减少比较次数。该软件包支持CPU和GPU执行。本文介绍了该软件包的设计、功能以及两个与官方统计相关的案例研究。所提出的软件将有助于对链接多源数据感兴趣的研究人员。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。