Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM). This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories -- representation and semantic heterogeneity -- and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles -- Findability, Accessibility, Interoperability, and Reusability -- demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them. Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.
翻译:实体匹配(EM)是数据集成与分析中的一项基础任务,其核心在于识别不同来源中指向同一现实世界实体的记录。在实际应用中,数据集通常在结构、格式、模式和语义上存在显著差异,这为实体匹配带来了巨大挑战。我们将这种场景称为异构实体匹配(HEM)。本综述通过引入一个基于先前工作的分类法,为HEM提供了一个统一的视角,该分类法区分了两个主要类别——表示异构性与语义异构性——及其子类型。该分类法为理解数据形式和意义的差异如何影响匹配任务的复杂性提供了系统性的视角。随后,我们将此框架与FAIR原则——可发现性、可访问性、互操作性和可重用性——联系起来,阐明这些原则如何既揭示了HEM的挑战,又提出了缓解这些挑战的策略。在此基础上,我们批判性地回顾了近期实体匹配方法,考察了它们处理不同类型异构性的能力,并对前沿模型进行了针对性实验,以评估其在语义异构性下的鲁棒性和适应性。我们的分析揭示了当前方法存在的持续局限性,并指出了未来研究的潜在方向,包括多模态匹配、人在回路的流程、与大型语言模型和知识图谱的更深度整合,以及在异构环境下进行公平性感知的评估。