Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.
翻译:传统的命名实体识别(NER)系统通常训练用于识别粗粒度实体,而对将实体分类为层次化的细粒度低层子类型的关注较少。本文旨在推进包含细粒度实体的阿拉伯语NER。我们选择扩展Wojood(一个开源嵌套阿拉伯语命名实体语料库),为其添加子类型。具体而言,Wojood中的四种主要实体类型——地缘政治实体(GPE)、位置(LOC)、组织(ORG)和设施(FAC)——被扩展为31个子类型。为此,我们首先修订了Wojood中GPE、LOC、ORG和FAC的标注,使其符合LDC的ACE指南,共产生5614处修改。其次,对Wojood中所有GPE、LOC、ORG和FAC的提及(约4.4万条)进行了基于LDC ACE子类型的手动标注。我们将此扩展版本称为WojoodFine。为评估标注质量,我们使用Cohen's Kappa和F1分数测量了标注者间一致性(IAA),结果分别为0.9861和0.9889。为计算WojoodFine的基线性能,我们在三种设置下微调了三个预训练阿拉伯语BERT编码器:平面NER、嵌套NER以及带子类型的嵌套NER,分别获得0.920、0.866和0.885的F1分数。我们的语料库和模型均为开源,可通过https://sina.birzeit.edu/wojood/获取。