This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
翻译:本文介绍了HistNERo,这是首个面向历史报纸的罗马尼亚语命名实体识别(NER)语料库。该数据集包含32.3万个文本标记,覆盖了19世纪大部分时期(即1817年)至20世纪后期(即1990年)的文本。八名罗马尼亚母语人士对数据集进行了五种命名实体的标注。样本分别来自罗马尼亚的四个历史区域,即比萨拉比亚、摩尔达维亚、特兰西瓦尼亚和瓦拉几亚。我们利用该数据集,使用罗马尼亚预训练语言模型进行了多项NER实验。结果表明,最优模型达到了55.69%的严格F1分数。此外,通过一种新颖的领域自适应技术减少区域间差异,我们将该语料库的性能提升至66.80%的严格F1分数,实现了超过10%的绝对增益。