We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.
翻译:我们提出了首个完全基于非洲语音数据训练的自监督多语言语音模型。该模型从撒哈拉以南非洲地区使用的21种语言及方言中学习了近6万小时的无标签语音片段。在FLEURS-102数据集的SSA子集上,基于HuBERT$_{base}$(0.09B参数)架构的方法在ASR下游任务中展现出与FLEURS基准中提出的w2v-bert-51(0.6B参数)预训练模型相当的竞争力,同时通过使用7倍更少的数据和6倍更少的参数实现了更高效率。此外,在LID下游任务中,我们的方法相比FLEURS基准的准确率提升超过22%。