Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate either in unsupervised settings and rely on very large and costly models, or in supervised settings and require ground-truth annotations, leaving a critical gap between time efficiency and effectiveness. To make LLM-powered ER more practical, we investigate Knowledge Distillation (KD) as a means to transfer knowledge from large, effective models (Teachers) to smaller, more efficient models (Students) without requiring gold labels. We introduce DistillER, the first framework that systematically bridges this gap across three dimensions: (i) Data Selection, where we study strategies for identifying informative subsets of data; (ii) Knowledge Elicitation, where we compare single- and multi-teacher settings across LLMs and smaller language models (SLMs); and (iii) Distillation Algorithms, where we evaluate supervised fine-tuning and reinforcement learning approaches. Our experiments reveal that supervised fine-tuning of Students on noisy labels generated by LLM Teachers consistently outperforms alternative KD strategies, while also enabling high-quality explanation generation. Finally, we benchmark DistillER against established supervised and unsupervised ER methods based on LLMs and SLMs, demonstrating significant improvements in both effectiveness and efficiency.
翻译:近期实体解析研究通过利用大语言模型取得了显著性能提升,但往往伴随着巨大的计算资源消耗或高昂的经济成本。现有基于大语言模型的实体解析方法要么在无监督设置下依赖规模庞大且成本高昂的模型,要么在监督设置下需要真实标注数据,在时间效率与解析效果之间存在显著鸿沟。为使大语言模型驱动的实体解析更具实用性,我们研究知识蒸馏技术,旨在将大型高效模型(教师模型)的知识迁移至更轻量高效的模型(学生模型),且无需黄金标注数据。本文提出DistillER框架,首次系统性地从三个维度弥合上述鸿沟:(一)数据选择:研究识别信息丰富数据子集的策略;(二)知识提取:比较大语言模型与轻量语言模型在单教师与多教师设置下的表现;(三)蒸馏算法:评估监督微调与强化学习方法。实验表明,基于大语言模型教师生成噪声标签对学生模型进行监督微调的方法,持续优于其他知识蒸馏策略,同时能够生成高质量的解释文本。最后,我们将DistillER与现有基于大语言模型和轻量语言模型的监督/无监督实体解析方法进行基准测试,在解析效果与运行效率方面均展现出显著提升。