Named entity recognition (NER) is a fundamental task in numerous downstream applications. Recently, researchers have employed pre-trained language models (PLMs) and large language models (LLMs) to address this task. However, fully leveraging the capabilities of PLMs and LLMs with minimal human effort remains challenging. In this paper, we propose GPT4NER, a method that prompts LLMs to resolve the few-shot NER task. GPT4NER constructs effective prompts using three key components: entity definition, few-shot examples, and chain-of-thought. By prompting LLMs with these effective prompts, GPT4NER transforms few-shot NER, which is traditionally considered as a sequence-labeling problem, into a sequence-generation problem. We conduct experiments on two benchmark datasets, CoNLL2003 and OntoNotes5.0, and compare the performance of GPT4NER to representative state-of-the-art models in both few-shot and fully supervised settings. Experimental results demonstrate that GPT4NER achieves the $F_1$ of 83.15\% on CoNLL2003 and 70.37\% on OntoNotes5.0, significantly outperforming few-shot baselines by an average margin of 7 points. Compared to fully-supervised baselines, GPT4NER achieves 87.9\% of their best performance on CoNLL2003 and 76.4\% of their best performance on OntoNotes5.0. We also utilize a relaxed-match metric for evaluation and report performance in the sub-task of named entity extraction (NEE), and experiments demonstrate their usefulness to help better understand model behaviors in the NER task.
翻译:命名实体识别(NER)是众多下游应用中的一项基础任务。最近,研究人员已采用预训练语言模型(PLMs)和大语言模型(LLMs)来解决此任务。然而,以最少的人力投入充分利用PLMs和LLMs的能力仍然具有挑战性。在本文中,我们提出了GPT4NER,一种通过提示LLMs来解决少样本NER任务的方法。GPT4NER利用三个关键组件构建有效的提示:实体定义、少样本示例和思维链。通过使用这些有效提示来引导LLMs,GPT4NER将传统上被视为序列标注问题的少样本NER,转化为一个序列生成问题。我们在两个基准数据集CoNLL2003和OntoNotes5.0上进行了实验,并将GPT4NER的性能与在少样本和全监督设置下的代表性最先进模型进行了比较。实验结果表明,GPT4NER在CoNLL2003上达到了83.15%的$F_1$分数,在OntoNotes5.0上达到了70.37%的$F_1$分数,显著优于少样本基线模型,平均优势约为7个百分点。与全监督基线模型相比,GPT4NER在CoNLL2003上达到了其最佳性能的87.9%,在OntoNotes5.0上达到了其最佳性能的76.4%。我们还采用了一种宽松匹配度量进行评估,并报告了在命名实体抽取(NEE)子任务中的性能,实验证明了这些方法有助于更好地理解模型在NER任务中的行为。