In this paper, we present our approach for the "Nuanced Arabic Dialect Identification (NADI) Shared Task 2023". We highlight our methodology for subtask 1 which deals with country-level dialect identification. Recognizing dialects plays an instrumental role in enhancing the performance of various downstream NLP tasks such as speech recognition and translation. The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem. Numerous transformer-based models, pre-trained on Arabic language, are employed for identifying country-level dialects. We fine-tune these state-of-the-art models on the provided dataset. The ensembling method is leveraged to yield improved performance of the system. We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
翻译:本文介绍了我们在“2023年细微阿拉伯方言识别(NADI)共享任务”中的方法。我们重点阐述了针对子任务1(国家层面方言识别)的技术路线。方言识别在提升语音识别、翻译等下游NLP任务性能中起着关键作用。该任务采用包含18种方言的Twitter数据集(TWT-2023)进行多分类问题研究。我们采用多种基于阿拉伯语预训练的Transformer模型进行国家层面方言识别,并在给定数据集上对这批最先进模型进行微调。通过集成学习方法提升系统性能,最终在测试集上取得了76.65的F1分数(排行榜第11名)。