Entity Matching (EM), which aims to identify all entity pairs referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being extremely labor-intensive, unsupervised EM is more applicable than supervised EM in practical scenarios. Traditional unsupervised EM assumes that all entities come from two tables; however, it is more common to match entities from multiple tables in practical applications, that is, multi-table entity matching (multi-table EM). Unfortunately, effective and efficient unsupervised multi-table EM remains under-explored. To fill this gap, this paper formally studies the problem of unsupervised multi-table entity matching and proposes an effective and efficient solution, termed as MultiEM. MultiEM is a parallelable pipeline of enhanced entity representation, table-wise hierarchical merging, and density-based pruning. Extensive experimental results on six real-world benchmark datasets demonstrate the superiority of MultiEM in terms of effectiveness and efficiency.
翻译:摘要:实体匹配(Entity Matching,EM)旨在从关系表中识别指向同一真实世界实体的所有实体对,是实际数据管理系统中最重要的任务之一。由于实体匹配的标注过程极其耗费人力,在实际场景中,无监督实体匹配比有监督实体匹配更具适用性。传统的无监督实体匹配假设所有实体来自两个表;然而,实际应用中更常见的是从多个表中匹配实体,即多表实体匹配(多表EM)。遗憾的是,有效且高效的无监督多表实体匹配仍未被充分探索。为填补这一空白,本文正式研究了无监督多表实体匹配问题,并提出了一种有效且高效的解决方案,称为MultiEM。MultiEM是一个可并行化的流水线,包含增强实体表示、表级层次化合并和基于密度的剪枝。在六个真实世界基准数据集上的大量实验结果证明了MultiEM在有效性和效率方面的优越性。