In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.
翻译:本研究提出AfriHuBERT,该模型基于mHuBERT-147——一个在147种语言上预训练的前沿紧凑型自我监督学习模型——进行扩展。尽管mHuBERT-147已涵盖16种非洲语言,我们通过在多源汇集的6500+小时语音数据上持续预训练,将覆盖范围扩展至39种非洲语言,其中包含23种新增语言。我们使用FLEURS数据集在语言识别与自动语音识别两项核心语音任务上评估AfriHuBERT。实验结果表明:在语言识别任务中平均F1分数提升+4%,在自动语音识别任务中平均词错误率降低1.2%。进一步分析显示,基于AfriHuBERT训练的自动语音识别模型展现出更强的跨语料库泛化能力。此外,分析指出FLEURS数据集存在数据质量局限,可能影响其对低资源非洲语言的评估适用性,这提示需要为这类语言建立更完善的评估基准。