While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
翻译:尽管基于预训练Transformer模型的最新方法在命名实体识别(NER)中具有较高的准确性,但在应用于长文档(如整部小说)时,其有限的上下文范围仍然是一个问题。为缓解这一问题,一种解决方案是在文档级别检索相关上下文。然而,此类任务缺乏监督信号,因此只能采用无监督方法。相反,我们提出使用指令微调的大语言模型(LLM)Alpaca生成一个合成上下文检索训练数据集。基于该数据集,我们训练了一个基于BERT模型的神经上下文检索器,能够为NER找到相关上下文。实验表明,在由40本书的第一章组成的英文文学数据集上,我们的方法在NER任务中优于多种检索基线方法。