Entity resolution, the task of identifying and consolidating records that pertain to the same real-world entity, plays a pivotal role in various sectors such as e-commerce, healthcare, and law enforcement. The emergence of Large Language Models (LLMs) like GPT-4 has introduced a new dimension to this task, leveraging their advanced linguistic capabilities. This paper explores the potential of LLMs in the entity resolution process, shedding light on both their advantages and the computational complexities associated with large-scale matching. We introduce strategies for the efficient utilization of LLMs, including the selection of an optimal set of matching questions, namely MQsSP, which is proved to be a NP-hard problem. Our approach optimally chooses the most effective matching questions while keep consumption limited to your budget . Additionally, we propose a method to adjust the distribution of possible partitions after receiving responses from LLMs, with the goal of reducing the uncertainty of entity resolution. We evaluate the effectiveness of our approach using entropy as a metric, and our experimental results demonstrate the efficiency and effectiveness of our proposed methods, offering promising prospects for real-world applications.
翻译:实体解析,即识别并合并指向同一现实世界实体的记录,在电子商务、医疗保健和执法等多个领域发挥着关键作用。GPT-4等大型语言模型(LLMs)的出现为这一任务引入了全新维度,凭借其先进的语言能力提供助力。本文探讨了LLMs在实体解析过程中的潜力,揭示了其优势以及与大规模匹配相关的计算复杂性。我们提出了高效利用LLMs的策略,包括选取一组最优的匹配问题(即MQsSP),该问题被证明是NP难问题。我们的方法在将消耗控制在预算范围内的同时,最优地选择最有效的匹配问题。此外,我们提出了一种在收到LLMs响应后调整可能分区分布的方法,旨在降低实体解析的不确定性。我们使用熵作为指标评估方法的有效性,实验结果表明了我们提出的方法的效率与效力,为实际应用提供了广阔前景。