Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models (LLMs) like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.
翻译:实体消歧(Entity Resolution)是指识别并合并指向同一真实世界实体的记录,是Web数据集成等领域中的关键任务。该任务的重要性因Web上存在大量重复及多版本数据资源而愈发凸显。然而,实现高质量的实体消歧通常需要投入大量精力。大型语言模型(如GPT-4)的出现展示了先进的语义理解能力,为该任务提供了全新范式。本文提出一个名为BoostER的演示系统,探究在实体消歧过程中利用大型语言模型的可行性,揭示其在易于部署和低成本方面的优势。我们的方法通过最优选择一组匹配问题并提交给大语言模型进行验证,随后利用模型的响应优化实体消歧结果的分布。这为实现面向真实应用的高质量实体消歧结果提供了前景,尤其适用于无需大规模模型训练或重大财务投资的个人或小型企业。