While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
翻译:尽管基于预训练Transformer的模型在命名实体识别(NER)任务上表现优异,但在处理长文档(如整本小说)时,其上下文窗口有限的问题依然存在。为解决此问题,一种方案是在文档级别检索相关上下文。然而,此类任务缺乏监督信号,迫使研究者只能采用无监督方法。本文提出使用经指令微调的大型语言模型(LLM)Alpaca来生成合成上下文检索训练数据集。基于该数据集,我们训练了一个基于BERT模型的神经上下文检索器,能够有效定位NER所需的相关上下文。实验表明,在包含40本书籍第一章的英文文学数据集上,我们的方法在NER任务中优于多种检索基线模型。