We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems in low/medium-resource scenarios. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. We also managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
翻译:我们探索了将跨语言多说话人语音合成与跨语言语音转换应用于中低资源场景下的自动语音识别(ASR)系统的数据增强。通过大量实验,我们证明了该方法允许在模型训练过程中仅使用单一目标语言说话人,即可应用语音合成和语音转换来提升ASR系统性能。同时,相比其他使用多位说话人的研究,我们成功缩小了基于合成语音与真人语音训练的ASR模型之间的性能差距。最后,我们表明,仅使用目标语言的一位真实说话人,便可通过我们的数据增强方法获得具有前景的ASR训练结果。