Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or non-matches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the training data size. Importantly, MoRER is the first method for building a model repository for ER problems, facilitating the continuous integration of new data sources by reducing the need for generating new training data.
翻译:实体解析(ER)是数据集成中的一项基础任务,它使得从异构数据源中获取洞察成为可能。ER的主要挑战在于将记录对分类为匹配项与非匹配项,而在多源ER(MS-ER)场景中,由于数据源的异构性和可扩展性问题,这一过程可能变得复杂。现有的MS-ER方法通常需要已标注的记录对,且此类方法无法在多个ER任务间有效复用模型。我们提出MoRER(用于实体解析的模型库),这是一种构建由解决ER问题的分类模型组成的模型库的新方法。通过利用特征分布分析,MoRER对相似的ER任务进行聚类,从而能够以适中的标注工作量有效初始化模型库。在三个多源数据集上的实验结果表明,MoRER在标签预算有限的方法(如主动学习和迁移学习方法)上取得了相当或更好的结果,同时优于利用大型预训练语言模型的自监督方法。与基于Transformer的有监督方法相比,MoRER根据训练数据规模的不同,取得了相当或更好的结果。重要的是,MoRER是首个为ER问题构建模型库的方法,它通过减少生成新训练数据的需求,促进了新数据源的持续集成。