In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI'2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%. This achievement closely aligns with the average F1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.
翻译:本文对影响阿拉伯方言识别(NADI'2023)性能的若干关键因素进行了深入分析,特别聚焦于第一个子任务——国家层面的方言识别。我们的研究涵盖了表层预处理、形态预处理、FastText向量模型以及TF-IDF特征的加权拼接对系统性能的影响。在分类方面,我们采用了线性支持向量分类(LSVC)模型。在评估阶段,我们的系统取得了显著成效,F1分数达到62.51%。这一结果与第一个子任务中其他提交系统的平均F1分数(72.91%)高度吻合。