The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally cheap method to select candidate record pairs, and a matcher that afterwards identifies matching pairs from this set using more expensive methods. This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space, and nearest neighbour search for candidate set building. We benchmark SC-Block against eight state-of-the-art blocking methods. In order to relate the training time of SC-Block to the reduction of the overall runtime of the entity resolution pipeline, we combine SC-Block with four matching methods into complete pipelines. For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher. The results show that SC-Block is able to create smaller candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster compared to pipelines with other blockers, without sacrificing F1 score. Blockers are often evaluated using relatively small datasets which might lead to runtime effects resulting from a large vocabulary size being overlooked. In order to measure runtimes in a more challenging setting, we introduce a new benchmark dataset that requires large numbers of product offers to be blocked. On this large-scale benchmark dataset, pipelines utilizing SC-Block and the best-performing matcher execute 8 times faster than pipelines utilizing another blocker with the same matcher reducing the runtime from 2.5 hours to 18 minutes, clearly compensating for the 5 minutes required for training SC-Block.
翻译:实体解析的目标是在多个数据集中识别代表同一真实世界实体的记录。然而,跨数据集比较所有记录计算量庞大,导致运行时间过长。为缩短运行时间,实体解析流程由两部分构成:一个阻塞器,采用计算成本低廉的方法筛选候选记录对;以及一个匹配器,随后利用计算成本较高的方法从该集合中识别匹配对。本文提出SC-Block,一种利用监督对比学习将记录定位至嵌入空间、并通过最近邻搜索构建候选集的阻塞方法。我们将SC-Block与八种最先进的阻塞方法进行基准测试。为关联SC-Block的训练时间与实体解析流程总运行时间的缩减幅度,我们将SC-Block与四种匹配方法组合为完整流程。为衡量总运行时间,我们确定具备99.5%配对完整性的候选集,并将其输入匹配器。结果表明,SC-Block能生成更小的候选集,且包含SC-Block的流程相比采用其他阻塞器的流程运行速度快1.5至2倍,且不牺牲F1分数。阻塞器通常使用相对较小的数据集进行评估,这可能导致因词汇量过大引发的运行时效应被忽视。为在更具挑战性的场景下测量运行时间,我们引入了一个新的基准数据集,要求对大量产品报价进行阻塞。在该大规模基准数据集上,采用SC-Block与最佳匹配器的流程运行速度比采用相同匹配器但使用其他阻塞器的流程快8倍,运行时间从2.5小时缩短至18分钟,显著弥补了SC-Block所需的5分钟训练时间。