Named Entity Recognition of Historical Texts via Large Language Model

Large language models (LLMs) have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.

翻译：大语言模型在广泛自然语言处理任务和领域中展现了卓越的多功能性。命名实体识别作为其中一项任务，涉及识别并分类文本中的专有名词，如人名、组织机构、地点、日期及其他特定实体。命名实体识别在从非结构化文本数据中提取信息方面发挥着关键作用，支持诸如从非结构化文本中进行信息检索等下游应用。传统上，命名实体识别采用监督式机器学习方法，这需要大量带标注的训练数据。然而，历史文本带来了独特挑战，因为手动标注成本高昂且需专门知识，导致此类标注数据集往往稀缺甚至不存在。此外，历史语言中的变异性与噪声（如不一致的拼写和古旧词汇）进一步增加了为这些资料开发可靠命名实体识别系统的难度。本研究探索了在零样本和少样本提示策略下，将大语言模型应用于历史文献中命名实体识别的可行性，此类策略几乎不需要或完全不需要针对特定任务的训练数据。我们在HIPE-2022数据集上开展的实验表明，在此设置下，大语言模型在命名实体识别任务上能够取得相当强的性能。尽管其表现仍逊于基于领域特定标注训练的全监督模型，但结果令人鼓舞。这些发现表明，在传统监督方法不可行的低资源或具有历史意义的语料库中，大语言模型为信息提取提供了一种可行且高效的替代方案。