Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.
翻译:传统命名实体识别(NER)系统通常针对粗粒度实体的识别进行训练,而对实体进行细粒度多层次子类型分类的关注较少。本文旨在推进阿拉伯语细粒度实体的NER研究。我们选择扩展Wojood(一个开源的嵌套阿拉伯语命名实体语料库)以纳入子类型。具体而言,Wojood中的四种主要实体类型——地缘政治实体(GPE)、地点(LOC)、组织(ORG)和设施(FAC)——被扩展为31个子类型。为此,我们首先修订了Wojood中GPE、LOC、ORG和FAC的标注,使其与LDC的ACE指南兼容,共产生5,614处修改。其次,Wojood中所有提及的GPE、LOC、ORG和FAC实体(约44,000个)均依据LDC的ACE子类型进行了手动标注。我们将此扩展版本称为WojoodFine。为评估标注质量,我们使用Cohen's Kappa和F1分数计算了标注者间一致性(IAA),结果分别为0.9861和0.9889。为生成WojoodFine的基线结果,我们在三种设置下微调了三个预训练的阿拉伯语BERT编码器:扁平NER、嵌套NER以及带子类型的嵌套NER,并分别取得了0.920、0.866和0.885的F1分数。我们的语料库和模型均为开源,可通过https://sina.birzeit.edu/wojood/获取。