Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or nonmatches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the size of the training data set used.
翻译:实体解析(ER)是数据集成中的基础任务,能够从异构数据源中提取洞察。ER的主要挑战在于将记录对分类为匹配项与非匹配项,而在多源ER(MS-ER)场景中,由于数据源异质性和可扩展性问题,这一过程会变得尤为复杂。现有的MS-ER方法通常需要已标注的记录对,且此类方法难以在多个ER任务间有效复用模型。本文提出MoRER(实体解析模型库),这是一种构建由解决ER问题的分类模型组成的模型库的新方法。通过利用特征分布分析,MoRER能够聚类相似的ER任务,从而以适度的标注成本实现模型库的有效初始化。在三个多源数据集上的实验结果表明,MoRER在有限标注预算场景下(如主动学习和迁移学习方法)取得了可比或更优的结果,同时优于利用大规模预训练语言模型的自监督方法。与基于监督学习的Transformer方法相比,MoRER根据所用训练数据集的规模可取得相当或更好的性能。