Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass closed-source LLMs like GPT-3; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at https://github.com/szu-tera/LLMs4NER.
翻译:随着大型语言模型(LLMs)的兴起,命名实体识别(NER)正从序列标注任务演变为生成式范式。我们对开源LLMs在扁平及嵌套NER任务上进行了系统性评估,探讨了若干研究问题,包括生成式NER与传统NER模型间的性能差距、输出格式的影响、LLMs是否依赖记忆、以及微调后通用能力的保持情况。通过对八个不同规模的LLMs和四个标准NER数据集的实验,我们发现:(1)借助参数高效微调及内联括号或XML等结构化格式,开源LLMs实现了与基于编码器的传统模型相竞争的性能,并超越了GPT-3等闭源LLMs;(2)LLMs的NER能力源于其指令遵循与生成能力,而非对实体-标签对的简单记忆;(3)应用NER指令微调对LLMs的通用能力影响甚微,甚至因实体理解能力的增强而提升了在DROP等数据集上的表现。这些结果表明,基于LLMs的生成式NER是传统方法的一种有前景且用户友好的替代方案。相关数据与代码发布于https://github.com/szu-tera/LLMs4NER。