SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, which are all sense-annotated. The corpus is annotated using two different sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how tokens and senses are associated. Instead of linking a token to only one intended sense, SALMA links a token to multiple senses and provides a score to each sense. A smart web-based annotation tool was developed to support scoring multiple senses against a given word. In addition to sense annotations, we also annotated the corpus using six types of named entities. The quality of our annotations was assessed using various metrics (Kappa, Linear Weighted Kappa, Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error), which show very high inter-annotator agreement. To establish a Word Sense Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word Sense Disambiguation system using Target Sense Verification. We used this system to evaluate three Target Sense Verification models available in the literature. Our best model achieved an accuracy with 84.2% using Modern and 78.7% using Ghani. The full corpus and the annotation tool are open-source and publicly available at https://sina.birzeit.edu/salma/.
翻译:SALMA是首个阿拉伯语词义标注语料库,包含约3.4万个词例,所有词例均已完成词义标注。该语料库同时采用两种不同的词义资源(Modern与Ghani)进行标注。SALMA的创新之处在于词语与语义的关联方式:不同于仅将词例链接至单一目标语义,SALMA将每个词例关联至多个语义,并为每个语义赋予评分值。为此我们开发了基于Web的智能标注工具,支持对给定词语进行多语义评分。除词义标注外,我们还使用六类命名实体对语料库进行了标注。通过多种评估指标(Kappa系数、线性加权Kappa、二次加权Kappa、平均绝对误差和均方根误差)对标注质量进行评测,结果显示标注者间一致性非常高。为利用SALMA语料库建立词义消歧基线,我们开发了基于目标语义验证的端到端词义消歧系统,并使用该系统对文献中三种目标语义验证模型进行评估。最佳模型在Modern词义资源上达到84.2%的准确率,在Ghani词义资源上达到78.7%的准确率。完整语料库及标注工具已开源,可通过https://sina.birzeit.edu/salma/ 公开获取。