Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.
翻译:实体解析(ER)是在单个或多个数据库中识别指向同一实体的记录的过程。多年来,已开发出众多技术以应对ER挑战,近期的重点则集中在利用机器学习与深度学习方法进行匹配阶段。然而,文献中尚未考察通常用于评估基于学习的匹配算法的基准数据集的质量。为填补这一空白,我们提出了四种不同方法来评估13个已有数据集的难度与适用性:两种理论方法,涉及新的线性度量与现有的复杂度度量;以及两种实用方法,即最佳非线性匹配器与线性匹配器之间的差异,以及最佳基于学习的匹配器与完美 oracle 之间的差异。我们的分析表明,大多数流行数据集构成了相对简单的分类任务。因此,它们并不适合恰当评估基于学习的匹配算法。为解决此问题,我们提出一种新方法论以生成基准数据集,并通过创建四个新匹配任务将其付诸实践。我们验证了这些新基准更具挑战性,因此更适合推动该领域的进一步发展。