Exploring the application of powerful large language models (LLMs) on the fundamental named entity recognition (NER) task has drawn much attention recently. This work aims to investigate the possibilities of pushing the boundary of zero-shot NER with LLM via a training-free self-improving strategy. We propose a self-improving framework, which utilize an unlabeled corpus to stimulate the self-learning ability of LLMs on NER. First, we use LLM to make predictions on the unlabeled corpus and obtain the self-annotated data. Second, we explore various strategies to select reliable samples from the self-annotated dataset as demonstrations, considering the similarity, diversity and reliability of demonstrations. Finally, we conduct inference for the test query via in-context learning with the selected self-annotated demonstrations. Through comprehensive experimental analysis, our study yielded the following findings: (1) The self-improving framework further pushes the boundary of zero-shot NER with LLMs, and achieves an obvious performance improvement; (2) Iterative self-improving or naively increasing the size of unlabeled corpus does not guarantee improvements; (3) There might still be space for improvement via more advanced strategy for reliable entity selection.
翻译:探索将强大的大语言模型(LLMs)应用于基础命名实体识别(NER)任务的研究近期备受关注。本工作旨在通过无训练的自改进策略,探究利用大语言模型突破零样本命名实体识别性能边界的可能性。我们提出了一种自改进框架,利用未标注语料库激发大语言模型在命名实体识别任务上的自学习能力。首先,使用大语言模型对未标注语料库进行预测并获得自标注数据;其次,综合考虑示例的相似性、多样性和可靠性,探索多种策略从自标注数据集中筛选可靠样本作为示范示例;最后,通过上下文学习,利用筛选出的自标注示范示例对测试查询进行推理。通过全面的实验分析,本研究得出以下发现:(1)自改进框架进一步推动了大语言模型在零样本命名实体识别上的性能边界,并取得显著性能提升;(2)迭代式自改进或简单增加未标注语料库规模并不能保证性能提升;(3)通过更先进的可信实体选择策略,仍可能存在性能提升空间。