DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate either in unsupervised settings and rely on very large and costly models, or in supervised settings and require ground-truth annotations, leaving a critical gap between time efficiency and effectiveness. To make LLM-powered ER more practical, we investigate Knowledge Distillation (KD) as a means to transfer knowledge from large, effective models (Teachers) to smaller, more efficient models (Students) without requiring gold labels. We introduce DistillER, the first framework that systematically bridges this gap across three dimensions: (i) Data Selection, where we study strategies for identifying informative subsets of data; (ii) Knowledge Elicitation, where we compare single- and multi-teacher settings across LLMs and smaller language models (SLMs); and (iii) Distillation Algorithms, where we evaluate supervised fine-tuning and reinforcement learning approaches. Our experiments reveal that supervised fine-tuning of Students on noisy labels generated by LLM Teachers consistently outperforms alternative KD strategies, while also enabling high-quality explanation generation. Finally, we benchmark DistillER against established supervised and unsupervised ER methods based on LLMs and SLMs, demonstrating significant improvements in both effectiveness and efficiency.

翻译：近期实体解析研究通过利用大语言模型取得了显著性能提升，但往往伴随着巨大的计算资源消耗或高昂的经济成本。现有基于大语言模型的实体解析方法要么在无监督设置下依赖规模庞大且成本高昂的模型，要么在监督设置下需要真实标注数据，在时间效率与解析效果之间存在显著鸿沟。为使大语言模型驱动的实体解析更具实用性，我们研究知识蒸馏技术，旨在将大型高效模型（教师模型）的知识迁移至更轻量高效的模型（学生模型），且无需黄金标注数据。本文提出DistillER框架，首次系统性地从三个维度弥合上述鸿沟：（一）数据选择：研究识别信息丰富数据子集的策略；（二）知识提取：比较大语言模型与轻量语言模型在单教师与多教师设置下的表现；（三）蒸馏算法：评估监督微调与强化学习方法。实验表明，基于大语言模型教师生成噪声标签对学生模型进行监督微调的方法，持续优于其他知识蒸馏策略，同时能够生成高质量的解释文本。最后，我们将DistillER与现有基于大语言模型和轻量语言模型的监督/无监督实体解析方法进行基准测试，在解析效果与运行效率方面均展现出显著提升。

相关内容

实体解析

关注 5

不同的数据提供方对同一个事物即实体 (Entity)可能会有不同的描述 (这里的描述包括数据格式、表示方法等) ，每一个对实体的描述称为该实体的一个引用。实体解析，是指从一个“ 引用集合”中解析并映射到现实世界中的“ 实体”过程。实体解析(Entity Resolution)又被称为记录链接(Record Linkage) 、对象识别(object Identification ) 、个体识别(Individual Identification) 、重复检测(Duplicate Detection)

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

专知会员服务

46+阅读 · 2025年4月26日

如何将领域知识注入大模型？最新《将领域特定知识注入大语言模型》综述

专知会员服务

79+阅读 · 2025年2月24日

大型语言模型的知识蒸馏综述：方法、评估与应用

专知会员服务

79+阅读 · 2024年7月4日