Thanks to the rise of self-supervised learning, automatic speech recognition (ASR) systems now achieve near-human performance on a wide variety of datasets. However, they still lack generalization capability and are not robust to domain shifts like accent variations. In this work, we use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models. By incorporating various accents in the training set, we obtain both in-domain and out-of-domain improvements. Our numerical experiments show that we can reduce error rates by up to 25% (relative) on African and Belgian accents compared to single-domain training while keeping a good performance on standard French.
翻译:得益于自监督学习的兴起,自动语音识别(ASR)系统如今在多种数据集上实现了接近人类水平的性能。然而,它们仍然缺乏泛化能力,对口音变化等域迁移不够鲁棒。在本工作中,我们使用代表四种不同法语口音的语音音频来创建微调数据集,从而提高预训练ASR模型的鲁棒性。通过在训练集中融入多种口音,我们在域内和域外均取得了性能提升。数值实验表明,与单领域训练相比,在非洲口音和比利时口音上的错误率可降低多达25%(相对值),同时保持对标准法语的优异性能。