UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.

翻译：大语言模型（LLM）展现出显著的泛化能力，例如理解任意实体和关系。指令微调已被证明能够将LLM蒸馏为更具成本效益的模型（如Alpaca和Vicuna）。然而，此类学生模型在下游应用中仍与原始LLM存在较大差距。本文探索通过专注任务的指令微调进行定向蒸馏，训练学生模型在开放信息抽取等广泛应用类别中表现优异。以命名实体识别（NER）为案例，我们展示了如何将ChatGPT蒸馏为更小规模的UniversalNER模型，用于开放NER任务。为进行评估，我们构建了迄今为止最大的NER基准测试集，涵盖生物医学、编程、社交媒体、法律、金融等9个不同领域的43个数据集。在不使用任何直接监督的情况下，UniversalNER在数万种实体类型上取得了显著的NER准确率，平均F1值比Alpaca和Vicuna等通用指令微调模型高出30个百分点以上。尽管参数量极小，UniversalNER不仅获得了ChatGPT识别任意实体类型的能力，而且在NER准确率上平均超出ChatGPT 7-9个F1百分点。值得注意的是，UniversalNER甚至以较大优势超越了使用监督NER样本的最先进多任务指令微调系统（如InstructUIE）。我们还进行了全面的消融研究，以评估蒸馏方法中各组件的贡献。为促进未来定向蒸馏研究，我们公开了蒸馏方案、数据和UniversalNER模型。