A named entity recognition and classification plays the first and foremost important role in capturing semantics in data and anchoring in translation as well as downstream study for history. However, NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model. This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated. We fined-tuned the language model on history corpus, conducted extensive comparative experiments using our language model and pretrained muti-language models. We set up the hypothesis of combination of time and annotation information and tested it based on statistical t test. Our finding shows that phrase markers clearly improve the performance of NER model in predicting unseen entity in documents written far different time period. It also shows that each of phrase marker and corpus-specific trained model does not improve the performance. We discuss the future research directions and practical strategies to decipher the history document.
翻译:命名实体识别与分类在捕捉数据语义、锚定翻译及历史下游研究中起着首要关键作用。然而,历史文本中的命名实体识别面临标注语料稀缺、多语言变体、各类噪声以及远异于当代语言模型的惯例等挑战。本文介绍了跨越数个世纪记录的韩国历史语料库(王室秘书日记,即《承政院日记》),该语料库近期新增了史学家精心标注的命名实体信息及短语标记。我们对历史语料库的语言模型进行了微调,并利用我们的语言模型与预训练多语言模型开展了广泛对比实验。我们提出了时间与标注信息组合的假设,并基于统计t检验进行了验证。研究发现,短语标记能显著提升命名实体识别模型在预测远异时代文档中未见实体的性能,而单独的短语标记或语料库特定训练模型均未带来性能提升。我们探讨了破译历史文献的未来研究方向与实践策略。