Entity resolution is essential for data integration, facilitating analytics and insights from complex systems. Multi-source and incremental entity resolution address the challenges of integrating diverse and dynamic data, which is common in real-world scenarios. A critical question is how to classify matches and non-matches among record pairs from new and existing data sources. Traditional threshold-based methods often yield lower quality than machine learning (ML) approaches, while incremental methods may lack stability depending on the order in which new data is integrated. Additionally, reusing training data and existing models for new data sources is unresolved for multi-source entity resolution. Even the approach of transfer learning does not consider the challenge of which source domain should be used to transfer model and training data information for a certain target domain. Naive strategies for training new models for each new linkage problem are inefficient. This work addresses these challenges and focuses on creating as well as managing models with a small labeling effort and the selection of suitable models for new data sources based on feature distributions. The results of our method StoRe demonstrate that our approach achieves comparable qualitative results. Regarding efficiency, StoRe outperforms both a multi-source active learning and a transfer learning approach, achieving efficiency improvements of up to 48 times faster than the active learning approach and by a factor of 163 compared to the transfer learning method.
翻译:实体解析对于数据集成至关重要,有助于从复杂系统中获得分析和洞察。多源与增量实体解析旨在应对整合多样化和动态数据的挑战,这在现实场景中十分常见。一个关键问题是如何对新数据源和现有数据源中的记录对进行匹配与非匹配分类。传统的基于阈值的方法通常比机器学习(ML)方法的质量更低,而增量方法则可能因新数据集成顺序的不同而缺乏稳定性。此外,对于多源实体解析,如何为新数据源复用训练数据和现有模型仍未得到解决。即使是迁移学习方法,也未考虑应使用哪个源域来为特定目标域迁移模型和训练数据信息这一挑战。为每个新链接问题训练新模型的朴素策略效率低下。本研究针对这些挑战,重点在于以少量标注工作创建和管理模型,并基于特征分布为新数据源选择合适的模型。我们的方法StoRe的结果表明,该方法获得了可比较的定性结果。在效率方面,StoRe优于多源主动学习和迁移学习方法,比主动学习方法快达48倍,比迁移学习方法快163倍。