Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.
翻译:长期以来,阿拉伯语方言识别(ADI)一直被建模为单标签分类任务,但近期研究主张应将其构建为多标签分类任务。然而,ADI仍受限于单标签数据集的可用性,目前缺乏可用于训练的大规模多标签资源。通过分析基于单标签ADI数据训练的模型,我们发现将此类数据集重新用于多标签阿拉伯语方言识别(MLADI)的主要困难在于负样本的选择,因为许多被视为负样本的句子实际上可能在多种方言中均可接受。为解决这些问题,我们利用GPT-4o和二元方言可接受性分类器生成自动多标签标注,并以阿拉伯语方言化程度指标(ALDi)为指导进行聚合,从而构建了一个多标签数据集。随后,我们采用与方言复杂性及标签基数相匹配的课程学习策略,训练了一个基于BERT的多标签分类器。在MLADI排行榜上,我们性能最佳的LAHJATBERT模型取得了0.69的宏F1值,而先前报道的最强系统仅为0.55。代码与数据可在https://mohamedalaa9.github.io/lahjatbert/获取。