Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core challenge in entity matching extends beyond term fuzziness to the ambiguity in defining what constitutes a "match," especially when integrating with external databases. This ambiguity arises due to varying levels of detail and granularity among entities, complicating exact matches. We propose a novel approach that shifts focus from purely identifying semantic similarities to understanding and defining the "relations" between entities as crucial for resolving ambiguities in matching. By predefining a set of relations relevant to the task at hand, our method allows analysts to navigate the spectrum of similarity more effectively, from exact matches to conceptually related entities.
翻译:实体匹配是数据集成与清洗中的关键挑战,在模糊连接与去重等任务中占据核心地位。传统方法聚焦于通过编辑距离、Jaccard相似度等途径克服模糊术语表征问题,近年来更引入嵌入表示、深度神经网络以及大语言模型(如GPT)等先进技术。然而,实体匹配的核心挑战不仅限于术语模糊性,更在于定义何为“匹配”的模糊性——尤其是在与外部数据库集成时。这种歧义源于实体间细节层级与粒度的差异,使得精确匹配变得复杂。我们提出一种创新方法:将研究重心从纯粹识别语义相似性转向理解并定义实体间的“关系”,以此消解匹配中的歧义。通过预定义与任务相关的实体关系集合,本方法使分析师能更有效地在精确匹配与概念关联实体之间探索相似性光谱。