This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nabra. Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and kappa agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nabra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat.
翻译:本文介绍了Nabra,一个包含形态标注的叙利亚阿拉伯方言语料库。由叙利亚本地人组成的团队从社交媒体帖子、影视剧本、歌词及当地谚语等多个来源收集了超过6000个句子(约含6万词),构建了Nabra。该语料库涵盖阿勒颇、大马士革、代尔祖尔、哈马、霍姆斯、豪兰、拉塔基亚、马尔丁、拉卡和苏韦达等叙利亚本地多种方言。九名标注员对6万个词元进行了完整的上下文形态标注。团队培训标注员遵循方法论指南,确保语素标注的唯一性,并对标注进行了规范化处理。各特征的F1值与Cohen's Kappa一致性评分介于74%至98%之间,彰显了Nabra标注的优异质量。本语料库作为Currasat门户网站(https://sina.birzeit.edu/currasat)的组成部分,以开源形式公开发布。