In this paper, we describe a spoken Arabic dialect identification (ADI) model for Arabic that consistently outperforms previously published results on two benchmark datasets: ADI-5 and ADI-17. We explore two architectural variations: ResNet and ECAPA-TDNN, coupled with two types of acoustic features: MFCCs and features exratected from the pre-trained self-supervised model UniSpeech-SAT Large, as well as a fusion of all four variants. We find that individually, ECAPA-TDNN network outperforms ResNet, and models with UniSpeech-SAT features outperform models with MFCCs by a large margin. Furthermore, a fusion of all four variants consistently outperforms individual models. Our best models outperform previously reported results on both datasets, with accuracies of 84.7% and 96.9% on ADI-5 and ADI-17, respectively.
翻译:本文描述了一种针对阿拉伯语的口语方言识别(ADI)模型,该模型在两个基准数据集ADI-5和ADI-17上持续优于先前发表的结果。我们探索了两种架构变体:ResNet和ECAPA-TDNN,并结合两种声学特征:MFCC以及从预训练自监督模型UniSpeech-SAT Large中提取的特征,同时还包括所有四种变体的融合。我们发现,单独来看,ECAPA-TDNN网络优于ResNet,且使用UniSpeech-SAT特征的模型性能远超使用MFCC的模型。此外,所有四种变体的融合持续优于单一模型。我们的最佳模型在两个数据集上均优于先前报道的结果,在ADI-5和ADI-17上的准确率分别达到84.7%和96.9%。