This paper presents the contribution of our dzNLP team to the NADI 2024 shared task, specifically in Subtask 1 - Multi-label Country-level Dialect Identification (MLDID) (Closed Track). We explored various configurations to address the challenge: in Experiment 1, we utilized a union of n-gram analyzers (word, character, character with word boundaries) with different n-gram values; in Experiment 2, we combined a weighted union of Term Frequency-Inverse Document Frequency (TF-IDF) features with various weights; and in Experiment 3, we implemented a weighted major voting scheme using three classifiers: Linear Support Vector Classifier (LSVC), Random Forest (RF), and K-Nearest Neighbors (KNN). Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision. Notably, we achieved the highest precision score of 63.22% among the participating teams. However, our overall F1 score was approximately 21%, significantly impacted by a low recall rate of 12.87%. This indicates that while our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement in handling diverse dialectal variations.
翻译:本文介绍了我们dzNLP团队在NADI 2024共享任务中的贡献,具体针对子任务1——多标签国家层面方言识别(MLDID)(封闭赛道)。为应对该挑战,我们探索了多种配置方案:在实验1中,我们采用了不同n-gram值(词、字符、带词界字符)的n-gram分析器联合特征;在实验2中,我们将具有不同权重的词频-逆文档频率(TF-IDF)特征进行加权联合;在实验3中,我们通过线性支持向量分类器(LSVC)、随机森林(RF)和K近邻(KNN)三种分类器实现了加权多数投票方案。尽管我们的方法基于简单的传统机器学习技术,但在F1分数和精确率方面仍展现出竞争力。值得注意的是,我们在所有参赛团队中取得了63.22%的最高精确率。然而,由于仅12.87%的低召回率影响,我们的整体F1分数约为21%。这表明虽然模型具有高精确度,但在召回广泛方言标签方面存在不足,这凸显了在处理多样化方言变体时需要改进的关键领域。